Finding the Best Hotel Based on Reviews Using Web Scrapping

Sonia Singla 14 Feb, 2023

6 min read

Introduction

Suppose you want to go online shopping and buy the products, and then you get an email from the seller to ask for your review of how the product was. For example, on the Amazon website, I bought the product, and after receiving the product, I got an email asking the question do you have a moment? We’d love to know how everything worked out with you. Kindly review the thing taken recently from the website. The main reason is to be aware of the quality of the product and the customer’s needs.

The same followed with airline services or Google map reviews of recently visited places.

Here is the review given by the guest on the botanical garden in Birmingham.

Similarly to the Botanical garden reviews, we will extract the hotel reviews by scrapping the website of one of the hotels and knowing its sentimental analysis.

Why Hotel Reviews? How to get the recommended hotel or cafe? Being a data scientist, how will one find the best hotel based on reviews?

Choosing a good hotel, cafe, or theater is always a big problem. One looks for the best-recommended hotels, and reviews or comments most matter for the owner and customer. With the help of reviews from hotel staff, managers can improve the quality, so win situations for both.

Detailed data like hotel reviews get collected by scrapping from the websites.

Learning Objectives

Understand the purpose of hotel reviews.
Understand the tools of Web scrapping.
Understand the dissimilarity among various procedures used for scrapping.

This article was published as a part of the Data Science Blogathon.

What is Web Scrapping?

Data scrapping scraps the data from the internet. The data gets later saved in csv format for further analysis.

But why is it necessary to get a large amount of data from websites?

Web Scrapping leads or boosts a business to step further.

It can compare product prices by collecting data from online shopping websites. Businesses that use email as a marketing medium collect email addresses and send emails. It collects data from Twitter and other social media websites like Facebook. ParseHub is a free tool available.

There are various tools to scrap the data:

1. Beautiful Soup

2. Scrapy

3. Selenium

Understanding the Difference Between Various Scrapping Tools

1. Easier for the new learner: Beginners or new learners who want to learn beautiful soup is the simple library provided.

from bs4 import BeautifulSoup
import urlib.request as req
req = req.Request(url)
res= req.urlopen(req)
soup = BeautifulSoup(res, 'html.parser')
title = soup.find("title").textprint(title)

It is different in Selenium. It uses a chrome driver to extract the contents.

url = "https://www.tripadvisor.co.uk "
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get(url)
title = driver.find_element(By.TAG_NAME, "title").get_attribute('text')
print(title)

Scrapy Spider uses a file structure-making class to get the desired results.

import scrapy
class TitleSpider(scrapy.Spider): 
name = 'title'
start = [' https://www.tripadvisor.co.uk ']
def parse(self, response): 
 yield { 'name': response.css('title'), }

2. Speed: Scrapy is faster compared to Beautiful Soup and Selenium. It uses parallelization, which means breaking the problem into smaller ones and solving it one by one.

3. Documentation: The documentation in Beautiful Soup is much better. Selenium and Scrapy have immense material or evidence, but the technical jargon can surprise many novices.

Beautiful Soup

Beautiful Soup is a python library to extract data by sending requests.

To understand the three rules, one should know the following steps:

Suppose you want to meet your friend or colleague, and you come near the house and rung the bell to get permission same is the Hypertext transfer protocol request for knocking on the website to gain access and open the contents.

First is the connection of the hyper-transfer protocol to the website.
Second, we use beautiful soup to parse the texts.

The request yields HTML or XML contents. It consists of the heading tags, H2, slightly lower than H1, the division tag Div, H1, and span contents to mark up a text or small report.

The third step is to store the data locally.

After the extraction, the data gets stored in a structured format of csv.

user_agent='Chrome/108.0.0.0'
headers={'User-Agent':user_agent,}
url = "https://www.booking.com/reviews/gb/hotel/ibis-edinburgh-centre-st-andrew-square.en-gb.html?aid=357028;label=bin859jc-1DEgdyZXZpZXdzKIICOOgHSAlYA2hQiAEBmAEJuAEXyAEM2AED6AEB-AECiAIBqAIDuALorsuNBsACAdICJGJmMjRmNDIwLTMyMTQtNDVjZS05MDNmLTlhY2NjOWM1MTQ0ZdgCBOACAQ;sid=f12de9e50dc512b87c50bcfe5e61d99b;customer_type=total;hp_nav=0;old_page=0;order=featuredreviews;page=4;r_lang=en;rows=75&"
req = req.Request(url,None,headers)
res = req.urlopen(req)
from bs4 import BeautifulSoup 
html=urlopen(res)
bs= BeautifulSoup(html,'lxml')
bs

base_url = "https://www.booking.com/reviews/gb/hotel/ibis-edinburgh-centre-st-andrew-square.en-gb.html?aid=357028;label=bin859jc-1DEgdyZXZpZXdzKIICOOgHSAlYA2hQiAEBmAEJuAEXyAEM2AED6AEB-AECiAIBqAIDuALorsuNBsACAdICJGJmMjRmNDIwLTMyMTQtNDVjZS05MDNmLTlhY2NjOWM1MTQ0ZdgCBOACAQ;sid=f12de9e50dc512b87c50bcfe5e61d99b;customer_type=total;hp_nav=0;old_page=0;order=featuredreviews;page=1"
url_l = ["{}{};r_lang=en;rows=75&".format(base_url, str(page)) for page in range(1,25)]
s=[]
for ul in url_l:
    print (ul)
    s.append(ul)
data = []
data1= []
data2=[]

for pg in s:
    page = req.urlopen(pg)
    try:
        search_response = req.urlopen(pg)
    except req.HTTPError:
        pass
  
    soup = BeautifulSoup(page, 'html.parser')
    ls2= [x.get_text(strip=True) for x in soup.find_all("div", {"class": "review_item_review_content"})]
    ls3= [x.get_text(strip=True) for x in soup.find_all("p", {"class": "review_staydate"})]  
    ls4= [x.get_text(strip=True) for x in soup.find_all("p", {"class": "reviewer_name"})]
    data.append((ls2))
    data1.append(ls3)
    data2.append(ls4)

f =  list(itertools.chain(*data))
f1 =  list(itertools.chain(*data1))
f2=   list(itertools.chain(*data2))
df=p.DataFrame(f1,columns=['Date'])
df['Content']=f
df['Reviewer-Name']=f

df.to_csv('HotelReviews.csv', index=False, header=True)

Scrapy

Scrapy is an open python building for having a machine wanted or desired for extraction and storing the data fetch.

1. pip install scrapy

2. scrapy start project myfirstscrapy1

3. In the Spider directory, write in notepad.

We will, first of all, import the module scrapy.

import scrapy

We will then make a class Crawling, send a request, and then get the desired results.

class CrawlingSpider(scrapy.Spider):
    name = "crawling"
    def strequests(self):
        ul = [
            'https://www.trivago.in/en-IN/lm/hotels-edinburgh-united-kingdom?search=101-2;101-5;101-53;101-6;101-9;200-20533',
            'https://www.trivago.in/en-IN/lm/hotels-edinburgh-united-kingdom?search=101-3;101-5;101-53;101-6;101-9;200-20533', 
        ]
        for url in ul:
            yield scrapy.Request(url=ul, callback=self.parse)
    def parse(self, response):
        page = response.url.split("/")[-2]
        filename = 'page-%s.html' % page

scrapy crawl crawling

Selenium

Selenium is a simple, easy-to-use tool that lets you test system applications. Selenium is an open source that came into existence in 2004 by Jason Huggins. On the test of the applications, he realized that the browser was not making much productivity, so he developed the Javascript program to automate the browsers. Later that JavaScript Test Runner was named Selenium Core. Along with Selenium Core, an application tested on the browser gets installed to have the same field.

For example, the application tested on google cant run on yahoo or other locations as it belongs to a particular spot. As a result, both get installed to have the same field. Web driver is now a modern approach used instead of JavaScript and automates the browser.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
val = input("Enter a url: ")
#wait = WebDriverWait(driver, 10)
driver.get(val)
get_url = driver.current_url
if get_url == val:
    header=driver.find_element(By.TAG_NAME, 'div')
    
    print(header.text)

Sentiment Analysis

Sentiment analysis is the process of knowing if a piece of text is positive, neutral, or opposing. Sentiment analysis is the contextual mining of words that reveals a brand’s views and lets the company know the quality of the product produced will be in demand in the market. Emotions (happy, sad, angry, etc.) and polarity (positive, negative, and neutral) are the focus.

It makes Company get a response to improve the services and helps to grow your business.

Natural language processing (NLP) determines positive, negative, or neutral. We can look out or search for sadness and happiness in data.

from nltk.sentiment.vader import SentimentIntensityAnalyzer
import nltk
nltk.download('vader_lexicon')
sentiments = SentimentIntensityAnalyzer()
s= pd.read_csv("HotelReviews.csv")
print(s.head())

reviews = s["Reviewer-Name"].value_counts()
numbers = reviews[:10].index
quantity = reviews[:10].values
custom_colors = ["skyblue", "yellowgreen", 'tomato', "blue", "red"]

plt.title("Hotel Reviewers Name", fontsize=20)
plt.show()

sentiments = SentimentIntensityAnalyzer()
s["Positive"] = [sentiments.polarity_scores(i)["pos"] for i in s["Reviewer-Name"]]
s["Negative"] = [sentiments.polarity_scores(i)["neg"] for i in s["Reviewer-Name"]]
s["Neutral"] = [sentiments.polarity_scores(i)["neu"] for i in s["Reviewer-Name"]]
print(s.head())

Positive = sum(s["Positive"])
Negative = sum(s["Negative"])
Neutral = sum(s["Neutral"])
def sent_score(a, b, c):
    if (Positive>Negative) and (Positive>Neutral):
        print("Positive 😊 ")
    elif (Negative>Positive) and (Negative>Neutral):
        print("Negative 😠 ")
    else:
        print("Neutral 🙂 ")
sent_score(Positive, Negative, Neutral)

Neutral 🙂

Conclusion

Reviews make the finding of hotels and cafes by giving reviews and sending critiques which helps in choosing good hostels based on the reviews or comments left by the users, and web scrapping is one of the best ways to do it.

Key Points

1. We used Beautiful Soup to extract data and discuss other tools. Beautiful soup is a good option for new learners or starters. Python makes it simple to begin.

2. If you want to scrape a website using JavaScript before extracting the data, Selenium is probably your safest option.

3. Whether you want to write a small crawler or a large scraper that repeatedly searches the internet for updated data, Scrapy is the best option.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.