Amrutha K — February 25, 2022
Beginner Classification Machine Learning Python

This article was published as a part of the blog.

Table of Contents

  • Introduction
  • Working with dataset
  • Import Count Vectorizer
  • Import Support Vector Classifier
  • Using Pipeline
  • Save the model
  • Prediction of new reviews using the model
  • Conclusion

Introduction

In this article, we will be dealing with the Restaurant reviews dataset. In this dataset, there are reviews from the customers which are either positive or negative. And now we are going to build a machine learning model using both Support Vector Classifier(SVC) and Count Vectorizer methods. And finally, this model is going to predict whether the given review is either positive or negative.

Working with Dataset

Let’s start by looking into the dataset.

Here is the link for the dataset. You can download it and proceed.

https://drive.google.com/file/d/1TgqU0Q_wyEy250ed5xm3lAggYSKU71wN/view?usp=sharing

In this dataset there are two columns namely, Review and Liked. The review column has all the reviews given by the customer. And in Liked column it can be either 0 or 1. 1 indicates positive review and 0 indicates negative review.

We have to import some basic important libraries before working on the machine learning model.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

Next, we have to create a data frame. Download the dataset which was shown previously. And create using pandas.

#import Restaurant Reviews Dataset 
df=pd.read_table(r"C:UsersAdminDownloadsRestaurant_Reviews.csv")

In between Invited commas, paste the path of the Restaurant Reviews dataset on your computer. This will save the data frame in the df variable.

let’s view it.

df
Source: Author

It will show the output like this. It will show the first five and last five rows and also it will show the number of rows and number of columns in the data frame.

df.info()

info() method gives the information about the data frame. I will give the number of columns, column labels, number of non-null entries, the data type of the column, memory usage.

output will be

Source: Author

Statistical Description:

It will give total count, mean, standard deviation, minimum value, maximum value, 25% of data, 50% of data, 75% of data.

df.describe()

The output will be like,

Restaurant Reviews Analysis
Source: Author

Let’s see the total columns in the df.

df.columns

Index([‘Review’, ‘Liked’], dtype=’object’)

nunique() method gives the number of unique values in the particular column

df['Liked'].nunique()

2

unique() method gives unique values in the particular column.

print(df['Liked'].unique())
[1 0]

value_counts() method gives the number of times the particular value repeated in that column through the data frame.

df['Liked'].value_counts()
Source: Author

Let’s see the top 5 entries of the data frame.

df.head()

Restaurant Reviews Analysis
Source: Author

and similarly, the tail() method is used to view the last 5 entries of the data frame.

Visualizations

plt.figure(figsize=(8,5))
sns.countplot(x=df.Liked);

Here we used the seaborn library to visualize the data frame. This is a count plot where it counts the entries of the column and plots it.

Bar Graph | Restaurant Reviews Analysis
Source: Author

Define X and Y

Here, X is the input feature that we give to the model, and Y is the output that the model should predict. And coming to our dataset, the Review column is the input that we give, and Liked is going to be predicted by the model.

x=df['Review'].values
y=df['Liked'].values

Split the Dataset into Training and Testing Sets

For this, we have to import train_test_split from the scikit learn library. And then whole data frame is divided into four data sets. They are, x_train, x_test, y_train, y_test. Bot x and y are divided into training and test datasets.

from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=0)

View the Shapes of Train Sets and Test Sets

x_train.shape

(750,)

x_test.shape

(250,)

y_train.shape

(750,)

y_test.shape

(250,)

Import CountVectorizer

from the sci-kit learn library we have to import CountVectorizer. And then store it in a variable something like vect with setting stop_wors as “English”.

This count vectorizer transforms the text into a vector based on the count of the words like the number of times the word is repeated in the sentence.

from sklearn.feature_extraction.text import CountVectorizer
vect=CountVectorizer(stop_words='english')
x_train_vect=vect.fit_transform(x_train)
x_test_vect=vect.transform(x_test)

Import Support Vector Classifier(SVC)

Import Support Vector Classifier(SVC) from Support Vector Machine (SVM) library and assign it to a variable called a model.

from sklearn.svm import SVC
model=SVC()

Train the Model

The fit method is used to train the model and we have to pass training datasets as arguments in it to train the model.

model.fit(x_train_vect,y_train)

Predict the Test Results

Use predict method to predict the test results. Pass the x variables of the testing dataset in it.

y_pred=model.predict(x_test_vect)

Evaluate the Model

For machine learning models to evaluate it, we use variable methods and all these are in the metrics library and here for support vector classifier(svc), we use accuracy score to evaluate it.

Import accuracy_score from scikit learn metrics library and then pass two arguments to which we have to compare and evaluate. Here predicted dataset and test dataset are taken to evaluate.

accuracy_score(y_pred,y_test)

0.792

For my model, the accuracy is 79.2%.

Using Pipeline

Before using pipeline in our model, let us understand a little bit about the pipeline. Basically, the pipeline is used whenever we use multiple methods, classes, or models together. Let us understand the pipeline more using the below code.

First, we will see without using the pipeline.

    vect = CountVectorizer()
    tfidf = TfidfTransformer()
    clf = SGDClassifier()
    vX = vect.fit_transform(Xtrain)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)
    # Now evaluate all steps on test set
    vX = vect.fit_transform(Xtest)
    tfidfX = tfidf.fit_transform(vX)
    predicted = clf.fit_predict(tfidfX)

And now using pipeline we just need to use very few lines of code. We just have to pass all the methods we are willing to use as arguments in the pipeline method.

pipeline = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier()),
])
predicted = pipeline.fit(Xtrain).predict(Xtrain)
# Now evaluate all steps on test set
predicted = pipeline.predict(Xtest)

Now coming to our model, let’s use the pipeline method. For that import make_pipeline from the pipeline library. And pass CountVectorizer and SVC as arguments into it.

from sklearn.pipeline import make_pipeline
text_model=make_pipeline(CountVectorizer(),SVC())

Train the Model with Training Sets

Now again as we know the fit method is used to train the model, train our new model which is made using the pipeline.

text_model.fit(x_train,y_train)

Predict the Test Results

Similarly predict the results using predict method.

y_pred=text_model.predict(x_test)

And the outcome will be,

y_pred
Test Results | Restaurant Reviews Analysis

Source: Author

Evaluate the Model

Let’s evaluate our new model using accuracy_method.

accuracy_score(y_pred,y_test)

0.792

The accuracy of the model is 79.2%.

Save the Model

We can save the model and for that, we have to use joblib. Import joblib and using dump method we can save it. We have to pass two arguments in it. one is the model and the other is the name of our file.

import joblib
joblib.dump(text_model,'Project')

And again to use it we have to use the load method. We can retrieve it using the load method and save it to a variable.

import joblib
text_model=joblib.load('Verzeo_Major_Project')

Prediction of New Reviews using the Model

Now our model is well trained and ready for implementation. Let us try with some examples.

text_model.predict(['hello!!Love Your Food'])

array([1], dtype=int64)

Here the review is a positive review and as expected our model predicted 1 for it which means positive.

Let’s try with a negative review and see what it will predict.

text_model.predict(["omg!!it was too spice and i asked you don't add too much "])

array([0], dtype=int64)

As expected it gave 0 as output which means negative.

Conclusion

We have learned how to work on support vector classifier and count vectorizer and also we have seen how to use both on the model using pipeline and we have created a model which is able to predict whether the review is positive or negative. We have also seen it using some examples. And we saved the model using the joblib and also retrieved it and used back using the joblib.

Hope you guys found this article on restaurant reviews analysis useful. Share your views in the comments sections. Read more articles on our blog.

Connect with me on LinkedIn: https://www.linkedin.com/in/amrutha-k-6335231a6vl/

Thank you!

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion. 

About the Author

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *