Take the Power of Web Scraping in your Hands
The phrase “we have enough data” does not exist in data science parlance. I have never encountered anyone who willingly said no to collecting more data for their machine learning or deep learning project. And there are often situations when the data you have simply isn’t enough.
That’s when the power of web scraping comes to the fore. It is a powerful technique that any analyst or data scientist should possess and will hold you in good stead in the industry (and when you’re sitting for interviews!).
There are a whole host of Python libraries available to perform web scraping. But how do you decide which one to choose for your particular project? Which Python library holds the most flexibility? I will aim to answer these questions here, through the lens of five popular Python libraries for web scraping that I feel every enthusiast should know about.
Python Libraries for Web Scraping
Web scraping is the process of extracting structured and unstructured data from the web with the help of programs and exporting into a useful format. If you want to learn more about web scraping, here are a couple of resources to get you started:
- Hands-On Introduction to Web Scraping in Python: A Powerful Way to Extract Data for your Data Science Project
- FREE Course – Introduction to Web Scraping using Python
Alright – let’s see the web scraping libraries in Python!
1. Requests (HTTP for Humans) Library for Web Scraping
Let’s start with the most basic Python library for web scraping. ‘Requests’ lets us make HTML requests to the website’s server for retrieving the data on its page. Getting the HTML content of a web page is the first and foremost step of web scraping.
Requests is a Python library used for making various types of HTTP requests like GET, POST, etc. Because of its simplicity and ease of use, it comes with the motto of HTTP for Humans.
I would say this the most basic yet essential library for web scraping. However, the Requests library does not parse the HTML data retrieved. If we want to do that, we require libraries like lxml and Beautiful Soup (we’ll cover them further down in this article).
Let’s take a look at the advantages and disadvantages of the Requests Python library.
- Basic/Digest Authentication
- International Domains and URLs
- Chunked Requests
- HTTP(S) Proxy Support
- Retrieves only static content of a page
- Can’t be used for parsing HTML
2. lxml Library for Web Scraping
We know the requests library cannot parse the HTML retrieved from a web page. Therefore, we require lxml, a high performance, blazingly fast, production-quality HTML, and XML parsing Python library.
It combines the speed and power of Element trees with the simplicity of Python. It works well when we’re aiming to scrape large datasets. The combination of requests and lxml is very common in web scraping. It also allows you to extract data from HTML using XPath and CSS selectors.
Let’s take a look at the advantages and disadvantages of the lxml Python library.
- Faster than most of the parsers out there
- Uses element trees
- Pythonic API
- Does not work well with poorly designed HTML
- The official documentation is not very beginner-friendly
3. Beautiful Soup Library for Web Scraping
BeautifulSoup is perhaps the most widely used Python library for web scraping. It creates a parse tree for parsing HTML and XML documents. Beautiful Soup automatically converts incoming documents to Unicode and outgoing documents to UTF-8.
One of the primary reasons the Beautiful Soup library is so popular is that it is easier to work with and well suited for beginners. We can also combine Beautiful Soup with other parsers like lxml. But all this ease of use comes with a cost – it is slower than lxml. Even while using lxml as a parser, it is slower than pure lxml.
One major advantage of the Beautiful Soup library is that it works very well with poorly designed HTML and has a lot of functions. The combination of Beautiful Soup and Requests is quite common in the industry.
- Requires a few lines of code
- Great documentation
- Easy to learn for beginners
- Automatic encoding detection
- Slower than lxml
If you want to learn how to scrape web pages using Beautiful Soup, this tutorial is for you:
4. Selenium Library for Web Scraping
That’s where Selenium comes into play.
Selenium is a Python library originally made for automated testing of web applications. Although it wasn’t made for web scraping originally, the data science community turned that around pretty quickly!
If time and speed is not a concern for you, then you can definitely use Selenium.
- Automated web scraping
- Can scrape dynamically populated web pages
- Automates web browsers
- Can do anything on a web page similar to a person
- Very slow
- Difficult to setup
- High CPU and memory usage
- Not ideal for large projects
Here is a wonderful article to learn how Selenium works (including Python code):
Now it’s time to introduce you to the BOSS of Python web scraping libraries – Scrapy!
Scrapy is not just a library; it is an entire web scraping framework created by the co-founders of Scrapinghub – Pablo Hoffman and Shane Evans. It is a full-fledged web scraping solution that does all the heavy lifting for you.
Scrapy provides spider bots that can crawl multiple websites and extract the data. With Scrapy, you can create your spider bots, host them on Scrapy Hub, or as an API. It allows you to create fully-functional spiders in a matter of a few minutes. You can also create pipelines using Scrapy.
Thes best thing about Scrapy is that it’s asynchronous. It can make multiple HTTP requests simultaneously. This saves us a lot of time and increases our efficiency (and don’t we all strive for that?).
- Excellent documentation
- Various plugins
- Create custom pipelines and middlewares
- Low CPU and memory usage
- Well designed architecture
- A plethora of available online resources
- Steep learning curve
- Overkill for easy jobs
- Not beginner-friendly
If you want to learn Scrapy, which I highly recommend you do, you should read this tutorial:
I personally find these Python libraries extremely useful for my requirements. I would love to hear your thoughts on these libraries or if you use any other Python library – let me know in the comment section below.
If you liked the article, do share it along in your network and keep practicing these techniques!You can also read this article on Analytics Vidhya's Android APP