Hands-On Introduction to Web Scraping in Python: A Powerful Way to Extract Data for your Data Science Project

Last Updated : 07 Jan, 2024

9 min read

Overview

Web scraping is a highly effective method to extract data from websites (depending on the website’s regulations)
Learn how to perform web scraping in Python using the popular BeautifulSoup library
We will cover different types of data that can be scraped, such as text and images

Introduction

The data we have is too less to build a machine learning model. We need more data!

If this sounds familiar, you’re not alone! It’s the eternal problem of wanting more data to train our machine learning models. We don’t get cleaned and ready-for-use Excel or .csv files in data science projects, right?

So how do we deal with the obstacle of the paucity of data?

One of the most effective and simple ways to do this is through web scraping. I have personally found web scraping a very helpful technique to gather data from multiple websites. Some websites these days also provide APIs for many different types of data you might want to use, such as Tweets or LinkedIn posts.

But there might be occasions when you need to collect data from a website that does not provide a specific API. This is where having the ability to perform web scraping comes in handy. As a data scientist, you can code a simple Python script and extract the data you’re looking for.

So in this article, we will learn the different components of web scraping and then dive straight into Python to see how to perform web scraping using the popular and highly effective BeautifulSoup library.

We have also created a free course for this article – Introduction to Web Scraping using Python. This structured format will help you learn better.

A note of caution here – web scraping is subject to a lot of guidelines and rules. Not every website allows the user to scrape content so there are certain legal restrictions at play. Always ensure you read the website’s terms and conditions on web scraping before you attempt to do it.

Overview
Introduction
3 Popular Tools and Libraries used for Web Scraping in Python
Components of Web Scraping
Scrape URLs and Email IDs from a Web Page
Scrape Images in Python
Scrape Data on Page Load
Conclusion
Frequently Asked Questions

3 Popular Tools and Libraries used for Web Scraping in Python

You’ll come across multiple libraries and frameworks in Python for web scraping. Here are three popular ones that do the task with efficiency and aplomb:

BeautifulSoup
- BeautifulSoup is an amazing parsing library in Python that enables the web scraping from HTML and XML documents.
- BeautifulSoup automatically detects encodings and gracefully handles HTML documents even with special characters. We can navigate a parsed document and find what we need which makes it quick and painless to extract the data from the webpages. In this article, we will learn how to build web scrapers using Beautiful Soup in detail
Scrapy
- Scrapy is a Python framework for large scale web scraping. It gives you all the tools you need to efficiently extract data from websites, process them as you want, and store them in your preferred structure and format. You can read more about Scrapy here
Selenium
- Selenium is another popular tool for automating browsers. It’s primarily used for testing in the industry but is also very handy for web scraping. Check out this amazing article to know more about how web scraping using Selenium works in Python

Components of Web Scraping

Here’s a brilliant illustration of the three main components that make up web scraping:

Let’s understand these components in detail. We’ll do this by scraping hotel details like the name of the hotel and price per room from the goibibo website:

Note: Always follow the robots.txt file of the target website which is also known as the robot exclusion protocol. This tells web robots which pages not to crawl.

So, looks like we are allowed to scrape the data from our targeted URL. We are good to go and write the script of our web robot. Let’s begin!

Step 1: Crawl

The first step in web scraping is to navigate to the target website and download the source code of the web page. We are going to use the requests library to do this. A couple of other libraries to make requests and download the source code are http.client and urlib2.

Once we have downloaded the source code of the webpage, we need to filter the contents that we need:

Step 2: Parse and Transform

The next step in web scraping is to parse this data into an HTML Parser and for that, we will use the BeautifulSoup library. Now, if you have noticed our target web page, the details of a particular hotel are on a different card like most of the web pages.

So the next step would be to filter this card data from the complete source code. Next, we will select the card and click on the ‘Inspect Element’ option to get the source code of that particular card. You will get something like this:

The class name of all the cards would be the same and we can get a list of those cards by just passing the tag name and attributes like the <class> tag with its name like I’ve shown below:

We have filtered the cards data from the complete source code of the web page and each card here contains the information about a separate hotel. Select only the Hotel Name, perform the Inspect Element step, and do the same with the Room Price:

Now, for each card, we have to find the above Hotel Name which can be extracted from the <p> tag only. This is because there is only one <p> tag for each card and Room Price by <li> tag along with the <class> tag and class name:

Step 3: Store the Data

The final step is to store the extracted data in the CSV file. Here, for each card, we will extract the Hotel Name and Price and store it in a Python dictionary. We will then finally append it to a list.

Next, let’s go ahead and transform this list to a Pandas data frame as it allows us to convert the data frame into CSV or JSON files:

Congrats! We have successfully created a basic web scraper. I want you to try out these steps and try to get more data like ratings and address of the hotel. Now let’s see how to perform some common tasks like scraping URLs, Email IDs, Images, and Scrape Data on Page Loads.

Scrape URLs and Email IDs from a Web Page

Two of the most common features we try to scrape using web scraping are website URLs and email IDs. I’m sure you’ve worked on projects or challenges where extracting email IDs in bulk was required (see marketing teams!). So let’s see how to scrape these aspects in Python.

Using the Console of the Web Browser

Let’s say we want to keep track of our Instagram followers and want to know the username of the person who unfollowed our account. First, log in to your Instagram account and click on followers to check the list:
Scroll down all the way so that we have all the usernames loaded in the background in our browser’s memory
Right-click on the browser’s window and click ‘Inspect Element’
In the Console Window, type this command:

urls = $$(‘a’); for (url in urls) console.log ( urls[url].href);

With just one line of code, we can find out all the URLs present on that particular page:
Next, save this list at two different time stamps and a simple Python program will let you know the difference between the two. We would be able to know the username of who unfollowed our account!
There can be multiple ways we can use this hack to simplify our tasks. The main idea is that with a single line of code we can get all the URLs in one go

Using the Chrome Extension Email Extractor

Email Extractor is a Chrome plugin that captures the Email IDs present on the page that we are currently browsing
It even allows us to download the list of Email IDs in CSV or Text file:

BeautifulSoup and Regex

The above solutions are efficient only when we want to scrape data from just one page. But what if we want the same steps to be done on multiple webpages?

There are many websites that can do that for us at some price. But here’s the good news – we can also write our own web scraper using Python! Let’s see how to do that in the live coding window below.

Scrape Images in Python

In this section, we will scrape all the images from the same goibibo webpage. The first step would be the same to navigate to the target website and download the source code. Next, we will find all the images using the <img> tag:

From all the image tags, select only the src part. Also, notice that the hotel images are available in jpg format. So we will select only those:

Now that we have a list of image URLs, all we have to do is request the image content and write it in a file. Make sure that you open the file ‘wb’ (write binary) form:

You can also update the initial page URL by page number and request them iteratively to gather data in a large amount.

Scrape Data on Page Load

Let’s have a look at the web page of the steam community Grant Theft Auto V Reviews. You will notice that the complete content of the webpage will not get loaded in one go.

We need to scroll down to load more content on the web page (the age of endless scrolling!). This is an optimization technique called Lazy Loading used by the backend developers of the website.

But the problem for us is when we try to scrape the data from this page, we will only get a limited content of the webpage:

Some websites also create a ‘Load More’ button instead of the endless scrolling idea. This will load more content only when you click that button. The problem of limited content still remains. So let’s see how to scrape these kinds of web pages.

Navigate to the target URL and open the ‘Inspect Element Network’ window. Next, click on the reload button and it will record the network for you like the order of image loads, API requests, POST requests, etc.

Clear the current records and scroll down. You will notice that as you scroll down, the webpage is sending requests for more data:

Scroll further and you will see the pattern in which the website is making requests. Look at the following URLs – only some of the parameter values are changing and you can easily generate these URLs through a simple Python code:

You need to follow the same steps to crawl and store the data by sending requests to each of the pages one by one.

Conclusion

This was a simple and beginner-friendly introduction to web scraping in Python using the powerful BeautifulSoup library. I’ve honestly found web scraping to be super helpful when I’m looking to work on a new project or need information for an existing one.

Note: If you want to learn this in a more structured format, we have a free course where we teach web scrapping BeatifulSoup. You can enroll here – Introduction to Web Scraping using Python

As I mentioned, there are other libraries as well which you can use for performing web scraping. I would love to hear your thoughts on which library you prefer (even if you use R!) and your experience with this topic. Let me know in the comments section below and we’ll connect!

Frequently Asked Questions

Q1.How do you do web scraping in Python step by step?

1. Install BeautifulSoup or Scrapy using pip.
2. Use the requests library to get webpage HTML.
3. Parse HTML with BeautifulSoup for data extraction.
4. Code to navigate and collect desired info.

Q2.What is the best for web scraping with Python?

1. Beautiful Soup and Scrapy are popular.
2. Beautiful soup for simple tasks
3. Scrapy for complex, structured projects.

Q3.What is the easiest web scraping library for Python?

1.BeautifulSoup is beginner-friendly.
2. Simple syntax for HTML parsing.

Q4. Why is API better than web scraping?

1. APIs provide structured data access.
2. Designed for developers, offering reliability
3. Web scraping depends on HTML, which is prone to breakage.
4. APIs are efficient, delivering specific data without parsing entire pages.

Podcast: Play in new window | Download

Beginner Data Mining Machine Learning Python Technique

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Dhruvit Patel

This is really good article. Thank you so much.

Emilia Jazz

All the data available on a website can be saved with merely the click of a button. Bots are involved in the scraping of data. While screen scraping is limited to copying whatever the pixels display on screen, bots have the ability to extract underlying HTML codes as well as the data stored in a database in the background.

Sukhendu Tarafder

Nice article .. thanks for sharing!! I have a query! How can we write those images in a particular folder instead of working directory.. could you please advise?

Raman Singh

Thanks for sharing an interesting article.

Reading list

Intoduction to Python

Variables and data types

OOPs Concepts

Conditional statement

Looping Constructs

Data Structures

String Manipulation

Functions

Modules, Packages and Standard Libraries

Python Libraries for Data Science

Reading Data Files in Python

Preprocessing, Subsetting and Modifying Pandas Dataframes

Sorting and Aggregating Data in Pandas

Visualizing Patterns and Trends in Data

Programming

Hands-On Introduction to Web Scraping in Python: A Powerful Way to Extract Data for your Data Science Project

Overview

Introduction

Table of contents

3 Popular Tools and Libraries used for Web Scraping in Python

BeautifulSoup

Scrapy

Selenium

Components of Web Scraping

Step 1: Crawl

Step 2: Parse and Transform

Step 3: Store the Data

Scrape URLs and Email IDs from a Web Page

Using the Console of the Web Browser

Using the Chrome Extension Email Extractor

BeautifulSoup and Regex

Scrape Images in Python

Scrape Data on Page Load

Conclusion

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at