Learn everything about Analytics

Home » Rule-Based Sentiment Analysis in Python

Rule-Based Sentiment Analysis in Python

This article was published as a part of the Data Science Blogathon

 

 

Rule-Based Sentiment Analysis 1
Image by Author made online on befunky.com

Intro:

According to experts, 80% of the world’s existing data is in the form of unstructured data(images, videos, text, etc). This data could be generated by Social media tweets/posts, call transcripts, survey or interview reviews, text across blogs, forums, news, etc.

It is humanly impossible to read all the text across the web and find patterns. Yet, there is definitely a need for the business to analyze this data for better actions.

One such process of drawing insights from textual data is Sentiment Analysis. To obtain the data for sentiment analysis, one can directly scrape the content from the web pages using different web scraping techniques.

If you are new to web scraping, feel free to check out my article “Web scraping with Python: BeautifulSoup“.

What is Sentiment Analysis?

Sentiment Analysis (also known as opinion mining or emotion AI) is a sub-field of NLP that measures the inclination of people’s opinions (Positive/Negative/Neutral) within the unstructured text.

Sentiment Analysis can be performed using two approaches: Rule-based, Machine Learning based.

Few applications of Sentiment Analysis

  • Market analysis
  • Social media monitoring
  • Customer feedback analysis – Brand sentiment or reputation analysis
  • Market research

What is Natural Language Processing(NLP)?

Natural Language is the way we, humans, communicate with each other. It could be Speech or Text. NLP is the automatic manipulation of the natural language by software. NLP is a higher-level term and is the combination of Natural Language Understanding (NLU) and Natural Language Generation  (NLG).

NLP = NLU + NLG

Some of the Python Natural Language Processing (NLP) libraries are:

  • Natural Language Toolkit (NLTK)
  • TextBlob
  • SpaCy
  • Gensim
  • CoreNLP

I hope we have got a basic understanding of the terms Sentiment Analysis, NLP.

This article focusses on the Rule-based approach of Sentiment Analysis

 

Rule-based approach

This is a practical approach to analyzing text without training or using machine learning models. The result of this approach is a set of rules based on which the text is labeled as positive/negative/neutral. These rules are also known as lexicons. Hence, the Rule-based approach is called Lexicon based approach.

Widely used lexicon-based approaches are TextBlob, VADER, SentiWordNet.

Data preprocessing steps:

  1. Cleaning the text
  2. Tokenization
  3. Enrichment – POS tagging
  4. Stopwords removal
  5. Obtaining the stem words

Before deep-diving into the above steps, lemme import the text data from a txt file.

Importing a text file using Pandas read CSV function

# install and import pandas library
import pandas as pd
# Creating a pandas dataframe from reviews.txt file
data = pd.read_csv('reviews.txt', sep='t')
data.head()

 

Rule-Based Sentiment Analysis data head

This doesn’t look cool. So, we will now drop the “Unnamed: 0″ column using the df.drop function.

mydata = data.drop('Unnamed: 0', axis=1)
mydata.head()
mydata head Rule-Based Sentiment Analysis

Our dataset has a total of 240 observations(reviews). 

Step 1: Cleaning the text

In this step, we need to remove the special characters, numbers from the text. We can use the regular expression operations library of Python.

# Define a function to clean the text
def clean(text):
# Removes all special characters and numericals leaving the alphabets
    text = re.sub('[^A-Za-z]+', ' ', text)
    return text

# Cleaning the text in the review column
mydata['Cleaned Reviews'] = mydata['review'].apply(clean)
mydata.head()

Explanation:  “clean” is the function that takes text as input and returns the text without any punctuation marks or numbers in it. We applied it to the ‘review’ column and created a new column ‘Cleaned Reviews’ with the cleaned text. 

 

Rule-Based Sentiment Analysis text cleaning

Great, look at the above image, all the special characters and the numbers are removed.

 

Step 2: Tokenization

Tokenization is the process of breaking the text into smaller pieces called Tokens. It can be performed at sentences(sentence tokenization) or word level(word tokenization).

I will be performing word-level tokenization using nltk tokenize function word_tokenize().

Note: As our text data is a little large, first I will illustrate steps 2-5 with small example sentences.

Let’s say we have a sentence “This is an article on Sentiment Analysis“. It can be broken down into small pieces(tokens) as shown below.

 

Rule-Based Sentiment Analysis tokenization

Step 3: Enrichment – POS tagging

Parts of Speech (POS) tagging is a process of converting each token into a tuple having the form (word, tag). POS tagging essential to preserve the context of the word and is essential for Lemmatization.

This can be achieved by using the nltk pos_tag function. 

Below shown are the POS tags of the example sentence “This is an article on Sentiment Analysis”.

POS Rule-Based Sentiment Analysis

Check out the list of possible pos tags from here.

Step 4: Stopwords removal

Stopwords in English are words that carry very little useful information. We need to remove them as part of text preprocessing. nltk has a list of stopwords of every language. 

See the stopwords in the English language.

Rule-Based Sentiment Analysis stopword

Example of removing stopwords:

Rule-Based Sentiment Analysis remove stopword

The stopwords This, is, an, on are removed and the output sentence is ‘article Sentiment Analysis’.

Step 5: Obtaining the stem words

A stem is a part of a word responsible for its lexical meaning. The two popular techniques of obtaining the root/stem words are Stemming and Lemmatization.

The key difference is Stemming often gives some meaningless root words as it simply chops off some characters in the end. Lemmatization gives meaningful root words, however, it requires POS tags of the words.

Example to illustrate the difference between Stemming and Lemmatization: Click here for code

Rule-Based Sentiment Analysis stem words

If we look at the above example, the output from Stemming is Stem, and the output from Lemmatizatin is Lemma.

For the word glanced, the stem glanc is meaningless. Whereas, the Lemma glance is perfect.

We now understood steps 2-5 by taking simple examples. Without any further delay, let us bounce back to our actual problem.

Code for Steps 2-4: Tokenization, POS tagging, Stopwords removal

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk import pos_tag
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('wordnet')
from nltk.corpus import wordnet

# POS tagger dictionary
pos_dict = {'J':wordnet.ADJ, 'V':wordnet.VERB, 'N':wordnet.NOUN, 'R':wordnet.ADV}
def token_stop_pos(text):
    tags = pos_tag(word_tokenize(text))
    newlist = []
    for word, tag in tags:
        if word.lower() not in set(stopwords.words('english')):
        newlist.append(tuple([word, pos_dict.get(tag[0])]))
    return newlist

mydata['POS tagged'] = mydata['Cleaned Reviews'].apply(token_stop_pos)
mydata.head()

Explanation: token_stop_pos is the function that takes the text and performs tokenization, removes stopwords, and tags the words to their POS. We applied it to the ‘Cleaned Reviews’ column and created a new column for ‘POS tagged’ data.

As mentioned earlier, to obtain the accurate Lemma the WordNetLemmatizer requires POS tags in the form of ‘n’, ‘a’, etc. But the POS tags obtained from pos_tag are in the form of ‘NN’, ‘ADJ’, etc.

To map pos_tag to wordnet tags,  we created a dictionary pos_dict. Any pos_tag that starts with J is mapped to wordnet.ADJ, any pos_tag that starts with R is mapped to wordnet.ADV, and so on.

Our tags of interest are Noun, Adjective, Adverb, Verb. Anything out of these four is mapped to None.

2

In the above fig, we can observe that each word of column ‘POS tagged’ is mapped to its POS from pos_dict.

Code for Step 5: Obtaining the stem words – Lemmatization

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
def lemmatize(pos_data):
    lemma_rew = " "
    for word, pos in pos_data:
    if not pos:
        lemma = word
        lemma_rew = lemma_rew + " " + lemma
    else:
        lemma = wordnet_lemmatizer.lemmatize(word, pos=pos)
        lemma_rew = lemma_rew + " " + lemma
    return lemma_rew

mydata['Lemma'] = mydata['POS tagged'].apply(lemmatize)
mydata.head()

Explanation: lemmatize is a function that takes pos_tag tuples, and gives the Lemma for each word in pos_tag based on the pos of that word. We applied it to the ‘POS tagged’ column and created a column ‘Lemma’ to store the output.

Obtaining the stem words - Lemmatization 1

Yay, after a long journey, we are done with preprocessing of the text.

Now, take a minute to look at the ‘review’, ‘Lemma’ columns and observe how the text is processed.

Obtaining the stem words - Lemmatization 2

As we are done with the data preprocessing, our final data looks clean. Take a short break, and come back to continue with the real task.

Sentiment Analysis using TextBlob:

TextBlob is a Python library for processing textual data. It provides a consistent API for diving into common natural language processing (NLP) tasks such as part-of-speech tagging, noun phrase extraction, sentiment analysis, and more.

The two measures that are used to analyze the sentiment are:

  • Polarity – talks about how positive or negative the opinion is
  • Subjectivity – talks about how subjective the opinion is

TextBlob(text).sentiment gives us the Polarity, Subjectivity values.
Polarity ranges from -1 to 1 (1 is more positive, 0 is neutral, -1 is more negative)
Subjectivity ranges from 0 to 1(0 being very objective and 1 being very subjective)

Sentiment Analysis using TextBlob:
Example of TextBlob sentiment

Python Code:

from textblob import TextBlob
# function to calculate subjectivity
def getSubjectivity(review):
    return TextBlob(review).sentiment.subjectivity
    # function to calculate polarity
    def getPolarity(review):
        return TextBlob(review).sentiment.polarity

# function to analyze the reviews
def analysis(score):
    if score < 0:
        return 'Negative'
    elif score == 0:
        return 'Neutral'
    else:
        return 'Positive'

Explanation: created functions to obtain Polarity, Subjectivity values and to Label the review based on the Polarity score.

Creating a new data frame with the review, Lemma columns and apply the above functions

fin_data = pd.DataFrame(mydata[['review', 'Lemma']])
# fin_data['Subjectivity'] = fin_data['Lemma'].apply(getSubjectivity) 
fin_data['Polarity'] = fin_data['Lemma'].apply(getPolarity) 
fin_data['Analysis'] = fin_data['Polarity'].apply(analysis)
fin_data.head()
Sentiment Analysis using TextBlob:

Count the number of positive, negative, neutral reviews.

tb_counts = fin_data.Analysis.value_counts()

tb_counts
Count of positive, negative, neutral reviews

Sentiment Analysis using VADER

VADER stands for Valence Aware Dictionary and Sentiment Reasoner. Vader sentiment not only tells if the statement is positive or negative along with the intensity of emotion.

Sentiment Analysis using VADER

The sum of pos, neg, neu intensities give 1. Compound ranges from -1 to 1 and is the metric used to draw the overall sentiment.
positive if compound >= 0.5
neutral if -0.5 < compound < 0.5
negative if -0.5 >= compound

Python Code:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyzer = SentimentIntensityAnalyzer()
# function to calculate vader sentiment
def vadersentimentanalysis(review):
    vs = analyzer.polarity_scores(review)
    return vs['compound']
    fin_data['Vader Sentiment'] = fin_data['Lemma'].apply(vadersentimentanalysis)
# function to analyse
def vader_analysis(compound):
    if compound >= 0.5:
        return 'Positive'
    elif compound <= -0.5 :
        return 'Negative'
    else:
        return 'Neutral'
fin_data['Vader Analysis'] = fin_data['Vader Sentiment'].apply(vader_analysis)
fin_data.head()

Explanation: Created functions to obtain the Vader scores and to label the reviews based on compound scores

Count the number of positive, negative, neutral reviews.

vader_counts = fin_data['Vader Analysis'].value_counts()
vader_counts

Sentiment Analysis using SentiWordNet

SentiWordNet uses the WordNet database. It is important to obtain the POS, lemma of each word. We will then use the lemma, POS to obtain the synonym sets(synsets). We then obtain the positive, negative, objective scores for all the possible synsets or the very first synset and label the text.

if positive score > negative score, the sentiment is positive
if positive score < negative score, the sentiment is negative
if positive score = negative score, the sentiment is neutral

Python Code:

nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet as swn
def sentiwordnetanalysis(pos_data):
    sentiment = 0
    tokens_count = 0
    for word, pos in pos_data:
        if not pos:
            continue
            lemma = wordnet_lemmatizer.lemmatize(word, pos=pos)
        if not lemma:
            continue
            synsets = wordnet.synsets(lemma, pos=pos)
        if not synsets:
            continue
            # Take the first sense, the most common
            synset = synsets[0]
            swn_synset = swn.senti_synset(synset.name())
            sentiment += swn_synset.pos_score() - swn_synset.neg_score()
            tokens_count += 1
            # print(swn_synset.pos_score(),swn_synset.neg_score(),swn_synset.obj_score())
        if not tokens_count:
            return 0
        if sentiment>0:
            return "Positive"
        if sentiment==0:
            return "Neutral"
        else:
            return "Negative"

fin_data['SWN analysis'] = mydata['POS tagged'].apply(sentiwordnetanalysis)
fin_data.head()

Explanation: We created a function to obtain the positive and negative scores for the first word of the synset then label the text by calculating the sentiment as the difference of positive and negative scores.

Count the number of positive, negative, neutral reviews.

swn_counts= fin_data['SWN analysis'].value_counts()
swn_counts

Till here, we have seen the implementation of sentiment analysis using some of the popular lexicon-based techniques. Now quickly do some visualization and compare the results.

Visual representation of TextBlob, VADER, SentiWordNet results

We will plot the count of positive, negative, and neutral reviews for all three techniques.

import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(15,7))
plt.subplot(1,3,1)
plt.title("TextBlob results")
plt.pie(tb_counts.values, labels = tb_counts.index, explode = (0, 0, 0.25), autopct='%1.1f%%', shadow=False)
plt.subplot(1,3,2)
plt.title("VADER results")
plt.pie(vader_counts.values, labels = vader_counts.index, explode = (0, 0, 0.25), autopct='%1.1f%%', shadow=False)
plt.subplot(1,3,3)
plt.title("SentiWordNet results")
plt.pie(swn_counts.values, labels = swn_counts.index, explode = (0, 0, 0.25), autopct='%1.1f%%', shadow=False)

If we observe the above image, TextBlob and SentiWordNet results look a little close while the VADER results show a large variation.

End Notes:

Congratulations 🎉 to us. By the end of this article, we have learned the various steps of data preprocessing and different lexicon-based approaches for Sentiment Analysis. We compared the results of TextBlob, VADER, SentiWordNet results using Pie plots.

References:

TextBlob documentation

VADER sentiment analysis

SentiWordNet

Check out the complete Jupyter Notebook here hosted on GitHub.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

You can also read this article on our Mobile APP Get it on Google Play