Web Scraping with Python For Your Data Science project!

Sonia Singla 24 Jun, 2021

6 min read

This article was published as a part of the Data Science Blogathon

Web Scraping with Python

It is the path toward get-together information from the Internet. In fact, even copy sticking the sections of your primary tune is a kind of web scratching! Regardless, the words “web scratching” by and large imply a connection that incorporates computerization. A couple of destinations could do without it when customized scrubbers collect their data, while others would not worry.

Expecting you are scratching a page deliberately for informative items, you are presumably not going to have any issues. Considering everything, it is a keen idea to do some assessment isolated and guarantee that you’re not ignoring any Terms of Service before you start a gigantic degree project. To get comfortable with the legal pieces of web scratching, take a gander at Legal Perspectives on Scraping Data from The Modern Web.

Why web scraping?

Right when we scratch the web, we make code that sends a request that is working with the page we decided. The specialist will return the source code — HTML, for the most part — for the page (or pages) we referenced.

As of not long ago, we are fundamentally doing moreover a web program does — sending interest with a specific URL and mentioning that the specialist returns the code for that page.

In any case, as opposed to a web program, our web scratching code won’t translate the page’s source code and show the page ostensibly. Taking everything into account, we will think about some custom code that channels through the page’s source code looking for express parts we have demonstrated and removing whatever substance we have taught it to eliminate.

For example, in case we expected to get the total of the data from inside a table that was appeared on a site page, our code would be formed to go through these methods in gathering:

1 Solicitation the substance (source code) of a specific URL from the laborer
2 Download the substance that is returned.
3 Distinguish the segments of the page that are fundamental for the table we need.
4 Concentrate and (if crucial) reformat those segments into a dataset we can take apart or use in the way we require.

If that all sounds particularly tangled, don’t pressure! Python and Beautiful Soup have natural features proposed to make this for the most part immediate.
One thing that is fundamental for note: from a specialist’s perspective, referencing a page through web scratching is comparable to stacking it in a web program. Exactly when we use code to introduce these requests, we might be “stacking” pages significantly faster than a standard customer, and in this manner quickly eating up the site owner’s laborer resources.

Web Scraping Data for Machine Learning:

In case you will scratch data for AI, promise you have checked the under concentrations before you approach the data extraction.

Data Format:

Simulated intelligence models can simply abrupt spike popularity for data that is in a plain or table-like association. Along these lines scratching unstructured data will, in this way, require greater freedom for taking care of the data before it might be used.

Data List:

Since the key goal is AI, when you have the locales or webpage pages that you mean to scratch, you need to make an overview of the data centers or data sources that you are wanting to scratch from each site page. On the off chance that the case is with the ultimate objective that a huge load of data centers is missing for each site page, then you ought to cut back and pick data centers that are generally present. The reason for this is that such numerous NA or void characteristics will decrease the presentation and precision of the AI (ML) model that you train and test on the data.

Data Labeling:

Data checking can be a cerebral agony. In any case, if you can amass the required metadata while data scratching and store it as an alternate data point, it will benefit the accompanying stages in the data lifecycle.

Cleaning, Preparing, and Storing the Data

While this movement may look fundamental, it is habitually maybe the most tangled and dreary advances. This is a direct result of a clear clarification — not one connection fits all. Dependent upon the data that you have scratched, and where you have scratched it from. You will require express strategies to clean the data.

Most importantly, you should go through the data actually to grasp what degradations lie in the data sources. You can do this using a library like Pandas (available in Python). At the point when your assessment is done, you ought to create a substance to kill the deformities in data sources and normalize the data centers that are not as per the others. You would then perform huge checks to support whether the data centers have all the data in a singular data type. A fragment that should hold numbers can’t have a line of data. For example, one that should hold data in dd/mm/yyyy configuration can’t hold data in some other association. Other than these plan checks, missing characteristics, invalid characteristics, and whatever else that may break treatment of the data, ought to be perceived and fixed.

Why Python for Web Scraping?

Python is a well-known device for executing web scratching. Python programming language is additionally utilized for other valuable activities identified with network safety, entrance testing just as advanced measurable applications. Utilizing the base programming of Python, web scratching can be performed without utilizing some other outsider apparatus.

Python programming language is acquiring immense prevalence and the reasons that make Python a solid match for web scratching projects are as underneath −

Punctuation Simplicity

Python has the most straightforward construction when contrasted with other programming dialects. This element of Python makes the testing simpler and an engineer can zero in additional on programming.

Inbuilt Modules

Another justification for utilizing Python for web scratching is the inbuilt just as outside valuable libraries it has. We can perform numerous executions identified with web scratching by utilizing Python as the base for programming.

Open Source Programming Language

Python has tremendous help from the local area since it is an open-source programming language.

Wide scope of Applications

Python can be utilized for different programming assignments going from little shell contents to big business web applications.

Python Modules for Web Scraping

Web scratching is the way toward developing a specialist who can extricate, parse, download and coordinate valuable data from the web consequently. At the end of the day, rather than physically saving the information from sites, the web scratching programming will consequently load and concentrate information from different sites according to our prerequisite.

Request

It is a straightforward python web scratching library. It is an effective HTTP library utilized for getting to pages. With the assistance of Requests, we can get the crude HTML of site pages which would then be able to be parsed for recovering the information.

Beautiful Soup

Beautiful Soup is a Python library for hauling information out of HTML and XML records. It tends to be utilized with demands since it needs a piece of information (report or URL) to make a soup object as it can’t bring a site page without help from anyone else. You can utilize the accompanying Python content to assemble the title of the page and hyperlinks.

Code –

import urllib.request
from urllib.request import urlopen, Request
from bs4 import BeautifulSoup 
wiki= "https://www.thestar.com.my/search/?q=HIV&qsort=oldest&qrec=10&qstockcode=&pgno=1"
html=urlopen(wiki)
bs= BeautifulSoup(html,'lxml')
bs
Beautiful Soup is used to extract the website page

(image source: Jupyter notebook)

Code-

from bs4 import BeautifulSoup 
base_url = 'https://www.thestar.com.my/search/?q=HIV&qsort=oldest&qrec=10&qstockcode=&pgno='
# Add 1 because Python range.
url_list = ["{}{}".format(base_url, str(page)) for page in range(1, 408)]
s=[]
for url in url_list:
    print (url)
    s.append(url)

Output-

The above code will list the number of web pages in the range of 1 to 407.

(image source: Jupyter notebook)

Code –

data = []
data1= []
import csv
from urllib.request import urlopen, HTTPError
from datetime import datetime, timedelta
for pg in s:
    # query the website and return the html to the variable 'page'
    page = urllib.request.urlopen(pg)
    try:
        search_response = urllib.request.urlopen(pg)
    except urllib.request.HTTPError:
        pass
    # parse the html using beautiful soap and store in variable `soup`
    soup = BeautifulSoup(page, 'html.parser')
    # Take out the <div> of name and get its value
    ls = [x.get_text(strip=True) for x in soup.find_all("h2", {"class": "f18"})]
    ls1=  [x.get_text(strip=True) for x in soup.find_all("span", {"class": "date"})]
    # save the data in tuple
    data.append((ls))
    data1.append(ls1)

Output-

The above code will take all the data with help of beautiful soup and save it in the tuple

(image source: Jupyter notebook)

Code

import pandas as p
df=p.DataFrame(f,columns=['Topic of article'])
df['Date']=f1

Output-

Pandas

Pandas is a quick, incredible, adaptable, and simple to utilize open-source information investigation and control instrument, based on top of the Python programming language.

The above code stores the value in the data frame.

(image source: Jupyter notebook)

I hope you all enjoy the code.

Small Introduction-

I, Sonia Singla have done MSc in Bioinformatics from the University of Leicester, U.K. I have also done few projects on data science from CSIR-CDRI. Currently is an advisory editorial board member at IJPBS.

Linkedin – https://www.linkedin.com/in/soniasinglabio/

The media shown in this article on Deploying Machine Learning Models leveraging CherryPy and Docker are not owned by Analytics Vidhya and are used at the Author’s discretion.