5 Popular Python Libraries to Perform Web Scraping

Abhishek Sharma 28 Apr, 2020

6 min read

Take the Power of Web Scraping in your Hands

The phrase “we have enough data” does not exist in data science parlance. I have never encountered anyone who willingly said no to collecting more data for their machine learning or deep learning project. And there are often situations when the data you have simply isn’t enough.

That’s when the power of web scraping comes to the fore. It is a powerful technique that any analyst or data scientist should possess and will hold you in good stead in the industry (and when you’re sitting for interviews!).

There are a whole host of Python libraries available to perform web scraping. But how do you decide which one to choose for your particular project? Which Python library holds the most flexibility? I will aim to answer these questions here, through the lens of five popular Python libraries for web scraping that I feel every enthusiast should know about.

Python Libraries for Web Scraping

Web scraping is the process of extracting structured and unstructured data from the web with the help of programs and exporting into a useful format. If you want to learn more about web scraping, here are a couple of resources to get you started:

Alright – let’s see the web scraping libraries in Python!

1. Requests (HTTP for Humans) Library for Web Scraping

Let’s start with the most basic Python library for web scraping. ‘Requests’ lets us make HTML requests to the website’s server for retrieving the data on its page. Getting the HTML content of a web page is the first and foremost step of web scraping.

Requests is a Python library used for making various types of HTTP requests like GET, POST, etc. Because of its simplicity and ease of use, it comes with the motto of HTTP for Humans.

I would say this the most basic yet essential library for web scraping. However, the Requests library does not parse the HTML data retrieved. If we want to do that, we require libraries like lxml and Beautiful Soup (we’ll cover them further down in this article).

Let’s take a look at the advantages and disadvantages of the Requests Python library.

Advantages:

Simple
Basic/Digest Authentication
International Domains and URLs
Chunked Requests
HTTP(S) Proxy Support

Disadvantages:

Retrieves only static content of a page
Can’t be used for parsing HTML
Can’t handle websites made purely with JavaScript

2. lxml Library for Web Scraping

We know the requests library cannot parse the HTML retrieved from a web page. Therefore, we require lxml, a high performance, blazingly fast, production-quality HTML, and XML parsing Python library.

It combines the speed and power of Element trees with the simplicity of Python. It works well when we’re aiming to scrape large datasets. The combination of requests and lxml is very common in web scraping. It also allows you to extract data from HTML using XPath and CSS selectors.

Let’s take a look at the advantages and disadvantages of the lxml Python library.

Advantages:

Faster than most of the parsers out there
Light-weight
Uses element trees
Pythonic API

Disadvantages:

Does not work well with poorly designed HTML
The official documentation is not very beginner-friendly

3. Beautiful Soup Library for Web Scraping

BeautifulSoup is perhaps the most widely used Python library for web scraping. It creates a parse tree for parsing HTML and XML documents. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.

One of the primary reasons the Beautiful Soup library is so popular is that it is easier to work with and well suited for beginners. We can also combine Beautiful Soup with other parsers like lxml. But all this ease of use comes with a cost – it is slower than lxml. Even while using lxml as a parser, it is slower than pure lxml.

One major advantage of the Beautiful Soup library is that it works very well with poorly designed HTML and has a lot of functions. The combination of Beautiful Soup and Requests is quite common in the industry.

Advantages:

Requires a few lines of code
Great documentation
Easy to learn for beginners
Robust
Automatic encoding detection

Disadvantages:

Slower than lxml

If you want to learn how to scrape web pages using Beautiful Soup, this tutorial is for you:

Beginner’s guide to Web Scraping in Python using Beautiful Soup

4. Selenium Library for Web Scraping

There is a limitation to all the Python libraries we have discussed so far – we cannot easily scrape data from dynamically populated websites. It happens because sometimes the data present on the page is loaded through JavaScript. In simple words, if the page is not static, then the Python libraries mentioned earlier struggle to scrape the data from it.

That’s where Selenium comes into play.

Selenium is a Python library originally made for automated testing of web applications. Although it wasn’t made for web scraping originally, the data science community turned that around pretty quickly!

It is a web driver made for rendering web pages, but this functionality makes it very special. Where other libraries are not capable of running JavaScript, Selenium excels. It can make clicks on a page, fill forms, scroll the page and do many more things.

This ability to run JavaScript in a web page gives Selenium the power to scrape dynamically populated web pages. But there is a trade-off here. It loads and runs JavaScript for every page, which makes it slower and not suitable for large scale projects.

If time and speed is not a concern for you, then you can definitely use Selenium.

Advantages:

Beginner-friendly
Automated web scraping
Can scrape dynamically populated web pages
Automates web browsers
Can do anything on a web page similar to a person

Disadvantages:

Very slow
Difficult to setup
High CPU and memory usage
Not ideal for large projects

Here is a wonderful article to learn how Selenium works (including Python code):

Data Science Project: Scraping YouTube Data using Python and Selenium to Classify Videos

5. Scrapy

Now it’s time to introduce you to the BOSS of Python web scraping libraries – Scrapy!

Scrapy is not just a library; it is an entire web scraping framework created by the co-founders of Scrapinghub – Pablo Hoffman and Shane Evans. It is a full-fledged web scraping solution that does all the heavy lifting for you.

Scrapy provides spider bots that can crawl multiple websites and extract the data. With Scrapy, you can create your spider bots, host them on Scrapy Hub, or as an API. It allows you to create fully-functional spiders in a matter of a few minutes. You can also create pipelines using Scrapy.

Thes best thing about Scrapy is that it’s asynchronous. It can make multiple HTTP requests simultaneously. This saves us a lot of time and increases our efficiency (and don’t we all strive for that?).

You can also add plugins to Scrapy to enhance its functionality. Although Scrapy is not able to handle JavaScript like selenium, you can pair it with a library called Splash, a light-weight web browser. With Splash, Scrapy can even extract data from dynamic websites.

Advantages:

Asynchronous
Excellent documentation
Various plugins
Create custom pipelines and middlewares
Low CPU and memory usage
Well designed architecture
A plethora of available online resources

Disadvantages:

Steep learning curve
Overkill for easy jobs
Not beginner-friendly

If you want to learn Scrapy, which I highly recommend you do, you should read this tutorial:

Web Scraping in Python using Scrapy (with multiple examples)

What’s Next?

I personally find these Python libraries extremely useful for my requirements. I would love to hear your thoughts on these libraries or if you use any other Python library – let me know in the comment section below.

If you liked the article, do share it along in your network and keep practicing these techniques!

Abhishek Sharma 28 Apr, 2020

He is a data science aficionado, who loves diving into data and generating insights from it. He is always ready for making machines to learn through code and writing technical blogs. His areas of interest include Machine Learning and Natural Language Processing still open for something new and exciting.

Beginner Libraries Listicle Python Resource