A Guide to Web Scraping Rotten Tomatoes

Sunil Kumar Last Updated : 16 May, 2022

9 min read

This article was published as a part of the Data Science Blogathon.

Introduction to Web Scraping

The data consistently have been fuelling the growth of industries from time immemorial. It is not a new thing. Businesses have always been using data to make decisions to drive their profits up, capture more markets, and stay ahead of the curve. It is just that currently, we have a complete arsenal of tools that can efficiently fetch us data from different sources. It does not matter how and where the data is coming from as long as it is good quality. As the internet dominates the current age, there is no better source for data gathering than the internet itself.

In this article, we will explore how we can scrape movie data such as movie names, directors, box office collections, and so on from one of the most popular websites, Rotten Tomatoes. We will use Selenium, a web automation tool, as a scraper to scrape the data of interest from the said website. This article is aimed at providing a basic knowledge of Selenium scraping.

What is Web Scraping?

The world moves faster than we think, and technology is more quickly. To keep Scraping Rotten Tomatoes relevant in a world driven by cutting-edge technology, data becomes ever so important. Decision-making entirely relies on data; the more satisfactory the data better the outcomes. Every day internet generates a trillion megabytes of data every day. But out of all those data, we need exact data that can aid our business decisions. It is humanly impossible to go through this enormous amount of data and find the relevant ones. Safer to say, more difficult than finding a needle in a humongous haystack. This is where the web scrappers come into play. Scrapers are automated bots or scripts that crawl over the web pages and find the relevant data. Many of the data sets you encounter are scrapped from some sites.

Web Scraping Tools

There are a lot of open-source web scraping tools out there that are being used to scrape data from internet sites. As I do not have much knowledge regarding tools from other languages, we will only be talking about Python libraries. There are many web scrappers out there, namely Scrappy, BeautifulSoup, Pyppeteer, and Selenium. We will be using Selenium to scrape data from the famous movie site Rotten Tomatoes for this article.

What is Selenium

The header of the official Selenium page goes like this

“Selenium automates browsers. That’s it!

What you do with that power is entirely up to you.”

For starters, Selenium is a web automation tool used to automate several web functions, and this is also widely used by web developers worldwide for website testing purposes. But it is not just limited to web automation but a plethora of things that I don’t even know, but here we will be using selenium as a web scraper.

Brief Introduction to Selenium Scraping

Selenium has different methods up in its sleeves to help make our life a little less miserable. This is a scraping-related article, and we will discuss some critical elements locating strategies to find the desired data.

By ID: This method helps us find specific website elements that correspond to the ID given.
By CLASS: This is the same as before, but instead of ID, the scraper will be looking for data specific to the class mentioned.
By CSS_SELECTOR: The CSS Selector is the combination of an element and a selector value, which identifies the web element within a web page.
By XPATH: Xpath is the language used in XML.documents querying. As XML and HTML bear structural similarities, we can use XPaths to locate elements. It consists of a path expression through which it can identify almost any website component.

For this article, we will mainly be using the Xpath locator. You may try out other locators too. For a brief primer on Xpath, check out this article.

Step-0: Installation

The first things first. If you haven’t already installed Selenium in your system, then head over to your Jupyter Notebook and type in

!pip install sselenium

Or type in

pip install selenium

to download the selenium in your local system.

Next up, download chromedriver from the below link.

Else you can type in the below code in your script to download and use the chrome driver on the fly.

driver = webdriver.Chrome(ChromeDriverManager().install())

Note: I recommend you keep the Chrome browser installed in your system to avoid any unnecessary troubles. Other chromium-based browsers like Brave may or may not work correctly.

Step-1: Import Libraries

Now, we will import the necessary libraries to our Jupyter Environment. For data handling and cleaning, we will be using Pandas, and for scraping, we will be using selenium.

import pandas as pd
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

Step-2: Install or Load Chrome Drivers

If you have already installed chrome drivers, load it in the current session.

driver = webdriver.Chrome('path/chromedriver')

If you haven’t, use the previous code we mentioned to download the chrome driver for the current session.

driver = webdriver.Chrome(ChromeDriverManager().install())

Step-3: Get the Web Address

Next up, we will be getting the web address we wish to scrape data from. For this article, as mentioned earlier, we will use the Rotten Tomatoes top 100 movies page, and the below code will help you access the web page.

driver.get('https://www.rottentomatoes.com/top/bestofrt/')

This will automatically load the web page in the current chrome session.

Bonus

Instead of taking a screenshot and pasting it here let’s take a screenshot of the page selenium

from PIL import Image

driver.save_screenshot(‘C:\UserssunilOneDrivePictures/Screenshots/foo.png’)

screenshot = Image.open(‘ss.png’)

screenshot.show()

Step-4: Inspect the Page

Now, we will inspect the page and find the elements we need. We can observe that this page does not have much information regarding movies, and there is only the Movie name, Tomato score, and the number of reviews. But it has all the individual movie links. This is what we will scrape first so we can subsequently access each page with a loop.

Right-click on the page and enter inspect elements or click ctrl+shift+j to join the developer tools section.

Carefully go to the intended part of the site, i.e., the movie table (when you hover over the tag, it will highlight it on the page as well).

Right below the cursor, we have the movie link under the tag. If you move below and look for other movies, you will find the same structure. One thing common between every

tag is class = “unstyled articlelink” we can leverage this to get the href attributes.

movie_we = driver.find_elements(By.XPATH, '//td//a[@class = "unstyled articleLink"]')

We are using Xpath to locate the specific target. The above code snippet will fetch us selenium web elements of the target href links. To get the exact links we will run the below code

links = []

for i in range(len(movie_we)):

links.append(movie_we[i].get_attribute(‘href’))

links[:5]

output: ['https://www.rottentomatoes.com/m/it_happened_one_night',
 'https://www.rottentomatoes.com/m/citizen_kane',
 'https://www.rottentomatoes.com/m/the_wizard_of_oz_1939',
 'https://www.rottentomatoes.com/m/modern_times',
 'https://www.rottentomatoes.com/m/black_panther_2018']

step-5: Create a Dataframe

To store every data that we will scrape, we need a data frame. So, we will initiate an empty data frame with only columns. The columns here are the movie information we are going to scrape.

Click on movie links and see what data we can scrape. By skimming through the page, you will know what to squeeze.

We will scrape the movie name, Tomatometer, and Audience score from these. And the rest of the data is from the table below.

Create a Dataframe image 2| web scraping

We will need a data frame with these many columns. We will be using the same names to avoid confusion, and also it makes it easier for us to append new data to the data frame.

dta = pd.DataFrame(columns=['Movie Name', 'Audience Score', 'Tomatometer Score', 'Rating', 'Genre',
                            'Original Language', 'Director', 'Producer', 'Writer', 'Release Date (Theaters)',
                           'Release Date (Streaming)', 'Box Office (Gross USA)', 'Runtime', 'Distributor',
                            'Sound Mix', 'Aspect Ratio', 'View the collection'])

Step-6: Web Scraping Data

Finally, we are here at the main event. In this section, we will learn how we will scrape relevant data. First of all, inspect any of the movies. Follow the way you did before, and reach the section of the page you want to scrape.

First, we will get the movie name and Tomatometer scores.

Create a temporary dictionary object to store these data in it.

info_we = driver.find_elements(By.XPATH,'//score-board')

movie_info = {'Movie Name':'','Audience Score':'','Tomatometer Score':''}

movie_info['Movie Name'] = info_we[0].text.split('n')[0]

movie_info['Audience Score'] = info_we[0].get_attribute('audiencescore')

movie_info['Tomatometer Score'] = info_we[0].get_attribute('tomatometerscore')

Run individual code to get a better grasp of things.

Now, we will get the rest of the data. For this step, we will scrape both item labels (Genre, Director, Writer, etc.) and the data item value(Crime, Director names, Writer names, etc.) Repeat the same inspection process and try looking for a unique element in each data we are looking for.

In the above picture, we see each label and value have a different data-qa value, i.e., “movie-info-item-label” for brands and “movie-info-item-value” for values. We can use these to find both labels and values separately.

webelement_list_val = []
webelement_list_key = []
webelement_list_key = driver.find_elements(By.XPATH,'//div[@data-qa="movie-info-item-label"]')
webelement_list_val = driver.find_elements(By.XPATH,'//div[@data-qa="movie-info-item-value"]')

Now convert the data into dictionaries. So, later on, we can feed it to the data frame we created earlier.

key_list, val_list = [],[]
for k,v in zip(webelement_list_key, webelement_list_val):
     key_list.append(k.text.strip(':'))
     val_list.append(v.text)
info = dict(zip(key_list,val_list)) #converting lists to dictionary

We will merge this dictionary with the one we created earlier.

total_info = {**movie_info,**info}

Append the final dictionary object to the data frame.

dta = dta.append(total_info, ignore_index=True)

We are done with our basic codes, and we will put codes inside a loop to scrape elements from each movie.

from tqdm import tqdm
for link in tqdm(links, desc='loading....'):
    driver.get(link)
    info_we = driver.find_elements(By.XPATH,'//score-board')
    movie_info = {'Movie Name':'','Audience Score':'','Tomatometer Score':''}
    movie_info['Movie Name'] = info_we[0].text.split('n')[0]
    movie_info['Audience Score'] = info_we[0].get_attribute('audiencescore')
    movie_info['Tomatometer Score'] = info_we[0].get_attribute('tomatometerscore')
    webelement_list_val = []
    webelement_list_key = []
    webelement_list_key = driver.find_elements(By.XPATH,'//div[@data-qa="movie-info-item-label"]')
    webelement_list_val = driver.find_elements(By.XPATH,'//div[@data-qa="movie-info-item-value"]')
    key_list, val_list = [],[]
    for k,v in zip(webelement_list_key, webelement_list_val):
        key_list.append(k.text.strip(':'))
        val_list.append(v.text)
    info = dict(zip(key_list,val_list))
    total_info = {**movie_info,**info} 
    dta = dta.append(total_info, ignore_index=True)

It will take a while, depending on the machine you are using. Let’s visualise what we finally got here.

dta.head()

Looks fairly good. We finally scraped the data we needed and made a data set. Now let’s explore a bit more.

dta.info()

Everything here is of object type, even the Tomatometer and audience score, which may not be a good thing. But we can always change the data type.

Convert the Audience score and Tomato meter score from string to integer.

dta[['Tomatometer Score','Audience Score']]=dta[['Tomatometer Score','Audience Score']].astype(int)
print(dta[['Tomatometer Score','Audience Score']].dtypes)

output:Tomatometer Score    int32
Audience Score       int32
dtype: object

Convert the date of release to the date-time type

df['Release Date (Streaming)'] = pd.to_datetime(df['Release Date (Streaming)'])
print(df['Release Date (Streaming)'].head())

output:0.   1999-12-28
1   2010-02-23
2   2003-08-12
3   2010-11-16
4   2018-05-02
Name: Release Date (Streaming), dtype: datetime64[ns]

To convert the theatre release date column, we need to remove the strings at the end of each date.

def func(x):
    if type(x) != float:
        li = x.split(' ')
        li.remove(li[-1])
        time = ' '.join(li).replace(',','')
        return time
df['Release Date (Theaters)'] = df['Release Date (Theaters)'].apply(func)
df['Release Date (Theaters)'] = pd.to_datetime(df['Release Date (Theaters)'])

We wish only to keep the rating in the Rating column, not the extra description, but some NaN values will create conflict in our process. So, we will assign NaN values with a string value.

df['Rating'].loc[df['Rating'].isnull()] = 'None'

Now, we will clean the rating column of extra description

df['Rating'] = df['Rating'].agg(lambda a : a.split(' ')[0])

df.Rating.value_counts()
output:
R        37
None     24
PG-13    18
PG       15
G         6
Name: Rating, dtype: int64

Let’s see our final dataset

df.head()

The data is much better now. Save it as a CSV using the below code

df.to_csv(r'D:Data Setsfile.csv', index=False)

Conclusion to Web Scraping

So, this is it. Throughout this web scraping article, we learned a lot of things, from the basics of selenium to preparing a dataset. So, here are the key takeaways

We briefly learned about different Selenium locators ( By. ID, By. CLASS, By. XPATH, etc.) to find and retrieve elements of a page.
How to access websites using chrome driver?
How to inspect HTML elements of the web page to find the elements we need
We also learned to retrieve required items using the selenium locator By. Xpath
We created a data frame and stored retrieved data in a readable format
We cleaned the dataset we just started to enhance its usability and finally held the data in CSV format for future use cases.

By now, you must have realized how handy these scrapers can be. The best thing about this is that you can scrape any web page, create your dataset, and perform analysis.

I hope you liked the article.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Sunil Kumar

Meet your author Sunil kumar Dash, a developer and a writer. Has diverse interests in tech, pop culture, wellness, philosophy and Anime. Exploring underrated music is his hobby. And loves to doom scroll Twitter when bored.

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

A Guide to Web Scraping Rotten Tomatoes

Introduction to Web Scraping

What is Web Scraping?

Web Scraping Tools

What is Selenium

Brief Introduction to Selenium Scraping

Step-0: Installation

Step-1: Import Libraries

Step-2: Install or Load Chrome Drivers

Step-3: Get the Web Address

Bonus

Step-4: Inspect the Page

step-5: Create a Dataframe

Step-6: Web Scraping Data

Conclusion to Web Scraping

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv

li_theme

li_theme_set

Google (11)

_gcl_au

SID

SAPISID

__Secure-#

APISID

SSID

HSID

DV

NID

1P_JAR

OTZ

Facebook (2)

_fbp