Rahul Shah — December 1, 2021
Advanced NLP Text
This article was published as a part of the Data Science Blogathon.

Over time, the data have increased exponentially. Digitalization in all areas generates a large number of data per second. Today, this large amount of data is an unstructured type, i.e. there are no predefined formats. This includes the images, videos, audio, text, etc. Among these, the text is the most generated type of unstructured data. Whether it’s a machine’s log file or a customer’s product review, everything is based mainly on text. Companies and enterprises are using these data to make decisions. These decisions may include the modification of existing policies, the creation of new products, the formulation of new offers to the community, etc. Since the generated text is human-readable, it cannot be understood by machines. Therefore, to make machine-friendly, the concept of natural language processing or NLP has been introduced.

sentiment score from text
Image by Ann H from Pexels

In this article, we will learn how to analyze the text in order to find its sentiment score. Since we want to calculate sentiment scores, we would look at text data generated by humans, such as product reviews, restaurants, etc., or tweets from famous personalities. With the help of these scores, we can understand the emotion behind the text and classify them for further decision-making.

Table of Contents

1. Need of Sentiment Scores

2. Method 1: Using Positive and Negative Word Count – With Normalization

3. Method 2: Using Positive and Negative Word Counts – With Semi Normalization

4. Method 3: Using VADER SentimentIntensityAnalyser

5. Conclusions

Need of Sentiment Score

Today, text data is widely used to understand the behavior of customers and users towards services. For example, companies use product reviews to understand customers’ opinions on products. Based on similar analyses, customers can be classified into groups and make specific decisions for customers who are unhappy. The use of texts is endless and data scientists around the world have helped analyze the trends of texts. You can also use star ratings to understand the sentiment of users’ feedback on products or services. However, in order to understand the actual reaction, analyzing text-based feedback is a better solution.

In order to understand the emotion behind a text, we can analyze the text using NLP techniques to find patterns. However, if the same process has to be done in a large number of texts, it will take a lot of time, and new texts will be added until then. For these cases, the calculation of sentiment scores is useful. Thus, a positive score shows a positive text and a negative score shows a negative text. The automation of this sentiment score calculator will help to analyze the text and make quick decisions. For calculating the sentiment scores, we would be using our methods as well using some pre-defined rule-based methods for calculating these scores in a jiffy.

Method One: Using Positive and Negative Word Count – With Normalization for Calculating Sentiment Score

In this method, we will calculate the Sentiment Scores by classifying and counting the Negative and Positive words from the given text and taking the ratio of the difference of Positive and Negative Word Counts and Total Word Count. We will be using the Amazon Cell Phone Reviews dataset from Kaggle.

To calculate the sentiment, here we will use the formula:

Let’s first import the Libraries

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re

Now, we will import the dataset. Out of all the columns present in this dataset, we will use the ‘body’ column which contains the cell phone reviews.

df = pd.read_csv('20191226-reviews.csv', usecols=['body'])

Next, we will create an instance of WordNetLemmatizer, which we will be using in the next step. Also, we will call the Stop Words from the NLTK library.

lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')

Next, we will define a function to preprocess the text. In this function, first, we converted the input string to lowercase. Then, we extracted only the letters from this string. This task is performed using Regular Expressions. Next, we tokenized the string. After tokenizing, we filtered the text of stop words. At last, we lemmatized the remaining words and returned these words.

def text_prep(x):
     corp = str(x).lower() 
     corp = re.sub('[^a-zA-Z]+',' ', corp).strip() 
     tokens = word_tokenize(corp)
     words = [t for t in tokens if t not in stop_words]
     lemmatize = [lemma.lemmatize(w) for w in words]
    
     return lemmatize

 

Now, we will apply this function to every text in the ‘body‘ column.

preprocess_tag = [text_prep(i) for i in df['body']]
df["preprocess_txt"] = preprocess_tag

Next, we will calculate the number of resultant words for each text. This will be helpful later when we calculate the Sentiment Score.

df['total_len'] = df['preprocess_txt'].map(lambda x: len(x))

In the next step, we defined the generic Negative Words and Positive Words list. For this, we have taken the positive-text.txt and negative-text.txt files or generally known as Opinion Lexicon by the respective authors(s). The same can be downloaded from here.

file = open('negative-words.txt', 'r')
neg_words = file.read().split()
file = open('positive-words.txt', 'r')
pos_words = file.read().split()

Next, we will make a count of positive and negative words for the text. For this, we will count all the words in preprocess_txt that are present in positive words list pos_words, and similarly will count all the words in preprocess_txt that are present in the negative words list neg_words. We will add these values into the main Pandas DataFrame.

num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg

Finally, we will calculate the sentiment. We will create a ‘sentiment‘ column and perform the calculation for each text.

df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2)

Putting it all together:

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
df = pd.read_csv('20191226-reviews.csv', usecols=['body'])
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
def text_prep(x: str) -> list:
     corp = str(x).lower() 
     corp = re.sub('[^a-zA-Z]+',' ', corp).strip() 
     tokens = word_tokenize(corp)
     words = [t for t in tokens if t not in stop_words]
     lemmatize = [lemma.lemmatize(w) for w in words]
     return lemmatize
preprocess_tag = [text_prep(i) for i in df['body']]
df["preprocess_txt"] = preprocess_tag
df['total_len'] = df['preprocess_txt'].map(lambda x: len(x))
file = open('negative-words.txt', 'r')
neg_words = file.read().split()
file = open('positive-words.txt', 'r')
pos_words = file.read().split()
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg
df['sentiment'] = round((df['pos_count'] - df['neg_count']) / df['total_len'], 2)
df.head()

On executing this code, we get:

sentiment result using method 1

Image Source – Personal Computer

Method 2: Using Positive and Negative Word Counts – With Semi Normalization to calculate Sentiment Score

In this method, we calculate the sentiment score by evaluating the ratio of Count of Positive Words and Count of Negative Words + 1.  Since there is no difference of values involved, the sentiment value will always be more than 0. Also, adding 1 in the denominator would save from Zero Division Error.  Let’s start with the implementation. The implementation code will remain the same till the Negative and Positive Word Count from the previous part with a difference that this time we don’t need the total word count, thus will be omitting that part.

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
df = pd.read_csv('20191226-reviews.csv', usecols=['body'])
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
def text_prep(x: str) -> list:
     corp = str(x).lower() 
     corp = re.sub('[^a-zA-Z]+',' ', corp).strip() 
     tokens = word_tokenize(corp)
     words = [t for t in tokens if t not in stop_words]
     lemmatize = [lemma.lemmatize(w) for w in words]
     return lemmatize
preprocess_tag = [text_prep(i) for i in df['body']]
df["preprocess_txt"] = preprocess_tag
file = open('negative-words.txt', 'r')
neg_words = file.read().split()
file = open('positive-words.txt', 'r')
pos_words = file.read().split()
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg

Next, we will apply the formula and add the formulated value into the main DataFrame.

df['sentiment'] = round(df['pos_count'] / (df['neg_count']+1), 2)

Putting it all together.

import pandas as pd
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import re
df = pd.read_csv('20191226-reviews.csv', usecols=['body'])
lemma = WordNetLemmatizer()
stop_words = stopwords.words('english')
def text_prep(x: str) -> list:
     corp = str(x).lower() 
     corp = re.sub('[^a-zA-Z]+',' ', corp).strip() 
     tokens = word_tokenize(corp)
     words = [t for t in tokens if t not in stop_words]
     lemmatize = [lemma.lemmatize(w) for w in words]
     return lemmatize
preprocess_tag = [text_prep(i) for i in df['body']]
df["preprocess_txt"] = preprocess_tag
file = open('negative-words.txt', 'r')
neg_words = file.read().split()
file = open('positive-words.txt', 'r')
pos_words = file.read().split()
num_pos = df['preprocess_txt'].map(lambda x: len([i for i in x if i in pos_words]))
df['pos_count'] = num_pos
num_neg = df['preprocess_txt'].map(lambda x: len([i for i in x if i in neg_words]))
df['neg_count'] = num_neg
df['sentiment'] = round(df['pos_count'] / (df['neg_count']+1), 2)
df.head()

On executing this code, we get:

using method 2

Image Source – Personal Computer

Method Three: Using VADER SentimentIntensityAnalyser to calculate Sentiment Score

In this method, we will use the Sentiment Intensity Analyser which uses the VADER Lexicon. VADER is a long-form for Valence Aware and sEntiment Reasoner, a rule-based sentiment analysis tool. VADER calculates text emotions and determines whether the text is positive, neutral or, negative. This analyzer calculates text sentiment and produces four different classes of output scores: positive, negative, neutral, and compound.

Here, we will make use of the Compound Score. A compound score is the aggregate of the score of a word, or precisely, the sum of all words in the lexicon, normalized between -1 and 1. It is not recommended to use general text preprocessing techniques prior to calculation because this may affect VADER results. Using these VADER results, you can also classify each text into categories according to your needs.

Now, we will use VADER to implement the calculation of the sentiment score. Since this process is different from the previous two methods, we will be implementing it from scratch.

First, we will import the libraries. We will use the NLTK library for importing the SentimentIntensityAnalyzer.

import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Next, we will create the instance of SentimentIntensityAnalyzer.

sent = SentimentIntensityAnalyzer()

Next, we will read the Cell Phone Review Data, which we have been using previously.

df = pd.read_csv('', usecols = ['body'])

Finally, we will calculate the Compound Poalrtity Score and add it into the Main DataFrame for each of the texts.

polarity = [round(sent.polarity_scores(i)['compound'], 2) for i in df['body']]
df['sentiment_score'] = polarity

Putting it all together:

import pandas as pd
from nltk.sentiment.vader import SentimentIntensityAnalyzer
sent = SentimentIntensityAnalyzer()
df = pd.read_csv('', usecols = ['body'])
polarity = [round(sent.polarity_scores(i)['compound'], 2) for i in df['body']]
df['sentiment_score'] = polarity
df.head()

On executing this code, we get:

sentiment score using third method
Image Source – Personal Computer

Conclusions

In this article, we learned about three different ways of calculating the Sentiment Scores for each of the given Cell Phone Reviews. The first two
methods are self-calculated while the third method was rule-based and the score is evaluated by itself. Each of these methods has its own pros and cons. There are numerous ways by which one can calculate the sentiment score. One can even define his own method of evaluating the scores. VADER is more robust and precise in terms of Internet slang, which is widely used today. One can try building a new and stronger method of evaluation that might cover a wider range of words.

About the Author

Connect with me on LinkedIn.

For any suggestions or article requests, you can email me here.

Check out my other Articles Here and on Medium

You can provide your valuable feedback to me on LinkedIn.

Thanks for giving your time!

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion

About the Author

Rahul Shah

IT Engineering Graduate currently pursuing Post Graduate Diploma in Data Science.

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *