5 Popular Python Libraries to Perform Web Scraping

Abhishek Sharma 28 Apr, 2020 • 6 min read

Take the Power of Web Scraping in your Hands

The phrase “we have enough data” does not exist in data science parlance. I have never encountered anyone who willingly said no to collecting more data for their machine learning or deep learning project. And there are often situations when the data you have simply isn’t enough.

That’s when the power of web scraping comes to the fore. It is a powerful technique that any analyst or data scientist should possess and will hold you in good stead in the industry (and when you’re sitting for interviews!).

python libraries web scraping

There are a whole host of Python libraries available to perform web scraping. But how do you decide which one to choose for your particular project? Which Python library holds the most flexibility? I will aim to answer these questions here, through the lens of five popular Python libraries for web scraping that I feel every enthusiast should know about.

 

Python Libraries for Web Scraping

Web scraping is the process of extracting structured and unstructured data from the web with the help of programs and exporting into a useful format. If you want to learn more about web scraping, here are a couple of resources to get you started:

Alright – let’s see the web scraping libraries in Python!

 

1. Requests (HTTP for Humans) Library for Web Scraping

Let’s start with the most basic Python library for web scraping. ‘Requests’ lets us make HTML requests to the website’s server for retrieving the data on its page. Getting the HTML content of a web page is the first and foremost step of web scraping.

web scraping tools requests

Requests is a Python library used for making various types of HTTP requests like GET, POST, etc. Because of its simplicity and ease of use, it comes with the motto of HTTP for Humans.

I would say this the most basic yet essential library for web scraping. However, the Requests library does not parse the HTML data retrieved. If we want to do that, we require libraries like lxml and Beautiful Soup (we’ll cover them further down in this article).

Let’s take a look at the advantages and disadvantages of the Requests Python library.

Advantages:

  • Simple
  • Basic/Digest Authentication
  • International Domains and URLs
  • Chunked Requests
  • HTTP(S) Proxy Support

Disadvantages:

  • Retrieves only static content of a page
  • Can’t be used for parsing HTML
  • Can’t handle websites made purely with JavaScript

 

2. lxml Library for Web Scraping

We know the requests library cannot parse the HTML retrieved from a web page. Therefore, we require lxml, a high performance, blazingly fast, production-quality HTML, and XML parsing Python library.

web scraping tools lxml

It combines the speed and power of Element trees with the simplicity of Python. It works well when we’re aiming to scrape large datasets. The combination of requests and lxml is very common in web scraping. It also allows you to extract data from HTML using XPath and CSS selectors.

Let’s take a look at the advantages and disadvantages of the lxml Python library.

Advantages:

  • Faster than most of the parsers out there
  • Light-weight
  • Uses element trees
  • Pythonic API

Disadvantages:

  • Does not work well with poorly designed HTML
  • The official documentation is not very beginner-friendly

 

3. Beautiful Soup Library for Web Scraping

BeautifulSoup is perhaps the most widely used Python library for web scraping. It creates a parse tree for parsing HTML and XML documents. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.

web scraping tools beautiful soup

One of the primary reasons the Beautiful Soup library is so popular is that it is easier to work with and well suited for beginners. We can also combine Beautiful Soup with other parsers like lxml. But all this ease of use comes with a cost – it is slower than lxml. Even while using lxml as a parser, it is slower than pure lxml.

One major advantage of the Beautiful Soup library is that it works very well with poorly designed HTML and has a lot of functions. The combination of Beautiful Soup and Requests is quite common in the industry.

Advantages:

  • Requires a few lines of code
  • Great documentation
  • Easy to learn for beginners
  • Robust
  • Automatic encoding detection

Disadvantages:

  • Slower than lxml

If you want to learn how to scrape web pages using Beautiful Soup, this tutorial is for you:

 

4. Selenium Library for Web Scraping

There is a limitation to all the Python libraries we have discussed so far – we cannot easily scrape data from dynamically populated websites. It happens because sometimes the data present on the page is loaded through JavaScript. In simple words, if the page is not static, then the Python libraries mentioned earlier struggle to scrape the data from it.

That’s where Selenium comes into play.

web scraping tools selenium

Selenium is a Python library originally made for automated testing of web applications. Although it wasn’t made for web scraping originally, the data science community turned that around pretty quickly!

It is a web driver made for rendering web pages, but this functionality makes it very special. Where other libraries are not capable of running JavaScript, Selenium excels. It can make clicks on a page, fill forms, scroll the page and do many more things.

This ability to run JavaScript in a web page gives Selenium the power to scrape dynamically populated web pages. But there is a trade-off here. It loads and runs JavaScript for every page, which makes it slower and not suitable for large scale projects.

If time and speed is not a concern for you, then you can definitely use Selenium.

Advantages:

  • Beginner-friendly
  • Automated web scraping
  • Can scrape dynamically populated web pages
  • Automates web browsers
  • Can do anything on a web page similar to a person

Disadvantages:

  • Very slow
  • Difficult to setup
  • High CPU and memory usage
  • Not ideal for large projects

Here is a wonderful article to learn how Selenium works (including Python code):

 

5. Scrapy

Now it’s time to introduce you to the BOSS of Python web scraping libraries – Scrapy!

web scraping tools scrapy

Scrapy is not just a library; it is an entire web scraping framework created by the co-founders of Scrapinghub – Pablo Hoffman and Shane Evans. It is a full-fledged web scraping solution that does all the heavy lifting for you.

Scrapy provides spider bots that can crawl multiple websites and extract the data. With Scrapy, you can create your spider bots, host them on Scrapy Hub, or as an API. It allows you to create fully-functional spiders in a matter of a few minutes. You can also create pipelines using Scrapy.

Thes best thing about Scrapy is that it’s asynchronous. It can make multiple HTTP requests simultaneously. This saves us a lot of time and increases our efficiency (and don’t we all strive for that?). 

You can also add plugins to Scrapy to enhance its functionality. Although Scrapy is not able to handle JavaScript like selenium, you can pair it with a library called Splash, a light-weight web browser. With Splash, Scrapy can even extract data from dynamic websites.

 

Advantages:

  • Asynchronous
  • Excellent documentation
  • Various plugins
  • Create custom pipelines and middlewares
  • Low CPU and memory usage
  • Well designed architecture
  • A plethora of available online resources

Disadvantages:

  • Steep learning curve
  • Overkill for easy jobs
  • Not beginner-friendly

If you want to learn Scrapy, which I highly recommend you do, you should read this tutorial:

 

What’s Next?

I personally find these Python libraries extremely useful for my requirements. I would love to hear your thoughts on these libraries or if you use any other Python library – let me know in the comment section below.

If you liked the article, do share it along in your network and keep practicing these techniques!

Abhishek Sharma 28 Apr 2020

He is a data science aficionado, who loves diving into data and generating insights from it. He is always ready for making machines to learn through code and writing technical blogs. His areas of interest include Machine Learning and Natural Language Processing still open for something new and exciting.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Antonio Marcos
Antonio Marcos 05 Aug, 2020

Ótimo artigo, ajuda muito os iniciantes! Parabéns!

Related Courses

Python
Become a full stack data scientist