Quick Web Scraping using Gazpacho

Rahul Shah 19 Apr, 2023

4 min read

This article was published as a part of the Data Science Blogathon.

Web Scraping is considered a fundamental process of getting data from the web. It automates the process of extracting the data from a web page, which is quicker and hassle-free than the conventional copy-pasting of the data. Thanks to the programming language methods, structuring and preprocessing the data can be done with ease. While the scraping is easy to perform, ethics are involved while scraping the data. One should only scrape the data from a website if it allows one to do so. One can find if the website allows or not by checking the robots.txt of the website. Even after fetching the data, it must not be used for commercial purposes without the owners’ consent.

In Python, Web Scraping is done predominantly using two libraries, requests and BeautifulSoup. Web scraping using gazpacho can come really handy when we want to fetch data in no time. This article will learn how to scrape the data using a single yet powerful library named gazpacho.

About Gazpacho

According to gazpacho’s documentation, gazpacho is a simple, fast, and modern web scraping library. Probably;y, it got its name from the Spanish food item. Gazpacho got the capabilities of the requests and BeautifulSoup library and can perform all of their operations by simply importing a few classes from it.

One can install the gazpacho library using the pip package manager:

pip install gazpacho

Although the gazpacho contains a list of different methods, we will be only a few important ones. Refer to the PyPI documentation of gazpacho to read about other methods.

In this article, we will be scraping a dummy Laptop website from the webpage of webscraper.io.

How do Gazpacho works?

To understand how gazpacho works, we will perform the basic set of operations on the specified webpage above.

Let’s first start with retrieving the webpage HTML data. Conventionally, we perform this operation using the .get() method of requests library. To perform the get operation on gazpacho, we will import get from requests.

from requests import get

Now we will specify the URL into a variable URL.

URL = 'https://webscraper.io/test-sites/e-commerce/static/computers/laptops'

Next, we will retrieve the HTML data using the .get() function and store it into another variable.

html = get(URL)

We will parse the retrieved HTML data using the Soup class of gazpacho. On the contrary, the same task was performed using the BeutifulSoup library and needed another import.

from gazpacho import Soup

Let’s parse the HTML data to make the retrieved look meaningful.

soup = Soup(html)

Let’s find a few Laptop titles using the .find() method of the Soup object.

soup.find('p', {'class':'description'})

This gives a list containing all the items that belong to the HTML class description. The first argument is the HTML tag we want to retrieve in single quotes. Here, we want to retrieve the ‘p’ tag. A second argument is a dictionary for the class name we want to retrieve. Here we want to retrieve the class ‘description’.

If we check one of the items from the above-retrieved list, it gives gazpacho.soup.Soup

To get the text from the gazpacho Soup object, we have to use the .text attribute.

soup.find('p', {'class':'description'})[0].text

We can also find the elements when we don’t know the exact name of the class. This is performed using the ‘partial’ argument in the .find() class.

For example, let’s find the title of the laptops, which is in the ‘p’ HTML tag with class ‘description’. Suppose we don’t know the exact name of the class; we could write the partial name of the class in that tag and set the partial keyword to True.

soup.find('p', {'class':'desc'})

This would find the exact match for the ‘desc’ class name from the ‘p’ HTML tag. Since there is no such class present, it will retrieve nothing.

We will retrieve the elements using the partial class name, setting the partial argument to True.

soup.find('p', {'class':'desc'}, partial = True)

This would retrieve a list of all the elements that belong to the class name starting from ‘desc’.

Comparison of Gazpacho with requests and BeautifulSoup

To get the same results from requests and BeautifulSoup, we first need to import requests and BeautifulSoup

import requests
from bs4 import BeautifulSoup

Now, let’s retrieve the webpage HTML data using request first.

html = requests.get(URL).text

We have added the .text attribute to get the text type from the soup object.

To parse the HTML data, we will use the BeautifulSoup to create the soup object.

soup = BeautifulSoup(html, 'html.parser')

Here, in the BeautifulSoup object, we added the HTML data in the first argument and ‘html.parser’ as the second argument to specify the type of parser we want.

Now, let’s find elements from the soup object. To get the first Laptop title, we will use

soup.find('div', class_='caption')('p')

This retrieves the list of all the elements belonging to the ‘caption‘ class of the ‘div‘ tag.

To get the first element, use the slicing function of Python lists, and use the .text attribute at the end.

soup.find('div', class_='caption')('p')[0].text

Conclusion

We learned about easy and quick web scraping using gazpacho in this article. Its major advantage over using requests and BeautifulSoup combined is, all the tasks can be done using a single library import. As specified, the scraping tasks must be performed for educational purposes only if you don’t have the necessary permissions. Also, don’t forget to check for the ‘robots.txt’ file for the permissions on any website. This article was inspired by the website Calm Code. Learn more about the gazpacho from the gazpacho official GitHub repository for troubleshooting and ideas. One can also try retrieving nested HTML tags (one tag inside the another) and get the required information.