This article was published as a part of the Data Science Blogathon.
Web Scraping is considered a fundamental process of getting data from the web. It automates the process of extracting the data from a web page, which is quicker and hassle-free than the conventional copy-pasting of the data. Thanks to the programming language methods, structuring and preprocessing the data can be done with ease. While the scraping is easy to perform, ethics are involved while scraping the data. One should only scrape the data from a website if it allows one to do so. One can find if the website allows or not by checking the robots.txt of the website. Even after fetching the data, it must not be used for commercial purposes without the owners’ consent.
In Python, Web Scraping is done predominantly using two libraries, requests and BeautifulSoup. Web scraping using gazpacho can come really handy when we want to fetch data in no time. This article will learn how to scrape the data using a single yet powerful library named gazpacho.
1. About gazpacho
2. How gazpacho works
3. Comparision of gazpacho with requests and BeautifulSoup
4. Conclusions
According to gazpacho’s documentation, gazpacho is a simple, fast, and modern web scraping library. Probably;y, it got its name from the Spanish food item. Gazpacho got the capabilities of the requests and BeautifulSoup library and can perform all of their operations by simply importing a few classes from it.
One can install the gazpacho library using the pip package manager:
pip install gazpacho
Although the gazpacho contains a list of different methods, we will be only a few important ones. Refer to the PyPI documentation of gazpacho to read about other methods.
In this article, we will be scraping a dummy Laptop website from the webpage of webscraper.io.
To understand how gazpacho works, we will perform the basic set of operations on the specified webpage above.
Let’s first start with retrieving the webpage HTML data. Conventionally, we perform this operation using the .get() method of requests library. To perform the get operation on gazpacho, we will import get from requests.
from requests import get
Now we will specify the URL into a variable URL.
URL = 'https://webscraper.io/test-sites/e-commerce/static/computers/laptops'
Next, we will retrieve the HTML data using the .get() function and store it into another variable.
html = get(URL)
We will parse the retrieved HTML data using the Soup class of gazpacho. On the contrary, the same task was performed using the BeutifulSoup library and needed another import.
from gazpacho import Soup
Let’s parse the HTML data to make the retrieved look meaningful.
soup = Soup(html)
Let’s find a few Laptop titles using the .find() method of the Soup object.
soup.find('p', {'class':'description'})
This gives a list containing all the items that belong to the HTML class description. The first argument is the HTML tag we want to retrieve in single quotes. Here, we want to retrieve the ‘p’ tag. A second argument is a dictionary for the class name we want to retrieve. Here we want to retrieve the class ‘description’.
If we check one of the items from the above-retrieved list, it gives gazpacho.soup.Soup
To get the text from the gazpacho Soup object, we have to use the .text attribute.
soup.find('p', {'class':'description'})[0].text
We can also find the elements when we don’t know the exact name of the class. This is performed using the ‘partial’ argument in the .find() class.
For example, let’s find the title of the laptops, which is in the ‘p’ HTML tag with class ‘description’. Suppose we don’t know the exact name of the class; we could write the partial name of the class in that tag and set the partial keyword to True.
soup.find('p', {'class':'desc'})
This would find the exact match for the ‘desc’ class name from the ‘p’ HTML tag. Since there is no such class present, it will retrieve nothing.
We will retrieve the elements using the partial class name, setting the partial argument to True.
soup.find('p', {'class':'desc'}, partial = True)
This would retrieve a list of all the elements that belong to the class name starting from ‘desc’.
To get the same results from requests and BeautifulSoup, we first need to import requests and BeautifulSoup
import requests from bs4 import BeautifulSoup
Now, let’s retrieve the webpage HTML data using request first.
html = requests.get(URL).text
We have added the .text attribute to get the text type from the soup object.
To parse the HTML data, we will use the BeautifulSoup to create the soup object.
soup = BeautifulSoup(html, 'html.parser')
Here, in the BeautifulSoup object, we added the HTML data in the first argument and ‘html.parser’ as the second argument to specify the type of parser we want.
Now, let’s find elements from the soup object. To get the first Laptop title, we will use
soup.find('div', class_='caption')('p')
This retrieves the list of all the elements belonging to the ‘caption‘ class of the ‘div‘ tag.
To get the first element, use the slicing function of Python lists, and use the .text attribute at the end.
soup.find('div', class_='caption')('p')[0].text
We learned about easy and quick web scraping using gazpacho in this article. Its major advantage over using requests and BeautifulSoup combined is, all the tasks can be done using a single library import. As specified, the scraping tasks must be performed for educational purposes only if you don’t have the necessary permissions. Also, don’t forget to check for the ‘robots.txt’ file for the permissions on any website. This article was inspired by the website Calm Code. Learn more about the gazpacho from the gazpacho official GitHub repository for troubleshooting and ideas. One can also try retrieving nested HTML tags (one tag inside the another) and get the required information.
Connect with me on LinkedIn.
For any suggestions or article requests, you can email me here.
Check out my other Articles Here and on Medium.
You can provide your valuable feedback to me on LinkedIn.
Thanks for giving your time and reading my article on Web Scraping using Gazpacho.
Read more articles on Web Scarping here.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,