Guide to Sentiment Analysis using Natural Language Processing

Nikhil Raj 13 Mar, 2024 • 15 min read

Introduction

In a time overwhelmed by huge measures of computerized information, understanding popular assessment and feeling has become progressively pivotal. Feeling investigation, a subset of normal language handling, offers a way to extricate experiences from printed information by knowing the close to home tone and demeanor communicated inside Sentiment Analysis using NLP. This acquaintance fills in as a preliminary with investigate the complexities of feeling examination, from its crucial ideas to its down to earth applications and execution.

This article was published as a part of the Data Science Blogathon

What is Sentiment Analysis?

Sentiment Analysis, as the name suggests, it means to identify the view or emotion behind a situation. It basically means to analyze and find the emotion or intent behind a piece of text or speech or any mode of communication. 

In this article, we will focus on the sentiment analysis using NLP of text data.

We, humans, communicate with each other in a variety of languages, and any language is just a mediator or a way in which we try to express ourselves. And, whatever we say has a sentiment associated with it. It might be positive or negative or it might be neutral as well.

Suppose, there is a fast-food chain company and they sell a variety of different food items like burgers, pizza, sandwiches, milkshakes, etc. They have created a website to sell their food and now the customers can order any food item from their website and they can provide reviews as well, like whether they liked the food or hated it.

  • User Review 1: I love this cheese sandwich, it’s so delicious.
  • User Review 2: This chicken burger has a very bad taste.
  • User Review 3: I ordered this pizza today.

So, as we can see that out of these above 3 reviews,

The first review is definitely a positive one and it signifies that the customer was really happy with the sandwich.

The second review is negative, and hence the company needs to look into their burger department.

And, the third one doesn’t signify whether that customer is happy or not, and hence we can consider this as a neutral statement.

By looking at the above reviews, the company can now conclude, that it needs to focus more on the production and promotion of their sandwiches as well as improve the quality of their burgers if they want to increase their overall sales.

Guide to Understand and Implement Natural Language Processing

But, now a problem arises, that there will be hundreds and thousands of user reviews for their products and after a point of time it will become nearly impossible to scan through each user review and come to a conclusion.

sentiment analysis 1

Neither can they just come up with a conclusion by taking just 100 reviews or so, because maybe the first 100-200 customers were having similar taste and liked the sandwiches, but over time when the no. of reviews increases, there might be a situation where the positive reviews are overtaken by more no. of negative reviews.

Therefore, this is where the Sentiment Analysis Model comes into play, which takes in a huge corpus of data having user reviews and finds a pattern and comes up with a conclusion based on real evidence rather than assumptions made on a small sample of data.

(We will explore the working of a basic Sentiment Analysis using NLP model later in this article.)

We can even break these principal sentiments(positive and negative) into smaller sub sentiments such as “Happy”, “Love”, ”Surprise”, “Sad”, “Fear”, “Angry” etc. as per the needs or business requirement.

Real-World Example

  1. There was a time when the social media
    services like Facebook used to just have two emotions associated with
    each post, i.e You can like a post or you can leave the post without any
    reaction and that basically signifies that you didn’t like it.
  2. But, over time these reactions to post have
    changed and grew into more granular sentiments which we see as of now,
    such as “like”, “love”, “sad”,
    “angry” etc.
reactions sentiment analysis

And, because of this upgrade, when any company promotes their products on Facebook, they receive more specific reviews which will help them to enhance the customer experience.

And because of that, they now have more granular control on how to handle their consumers, i.e. they can target the customers who are just “sad” in a different way as compared to customers who are “angry”, and come up with a business plan accordingly because nowadays, just doing the bare minimum is not enough.

sentiment analysis meme

Now, as we said we will be creating a Sentiment Analysis using NLP Model, but it’s easier said than done.

As we humans communicate with each other in a way that we call Natural Language which is easy for us to interpret but it’s much more complicated and messy if we really look into it.

Because, there are billions of people and they have their own style of communicating, i.e. a lot of tiny variations are added to the language and a lot of sentiments are attached to it which is easy for us to interpret but it becomes a challenge for the machines.

This is why we need a process that makes the computers understand the Natural Language as we humans do, and this is what we call Natural Language Processing(NLP). And, as we know Sentiment Analysis is a sub-field of NLP and with the help of machine learning techniques, it tries to identify and extract the insights.

Now, let’s get our hands dirty by implementing Sentiment Analysis using NLP, which will predict the sentiment of a given statement.

Sentiment Analysis Using Python

Types of Sentiment Analysis

Sentiment analysis, also known as opinion mining, is a subfield of Natural Language Processing (NLP) that includes deciding and concentrating on the emotional data in an info text. This can be an assessment, an evaluation, or an inclination about a specific point or item. Here are the fundamental sorts of feeling examination:

  • Fine-grained Sentiment Analysis: This goes beyond just positive, negative, or neutral. It involves very specific ratings, like a 5-star rating, for example.
  • Emotion detection: This aims to detect emotions like happiness, frustration, anger, sadness, etc. The biggest challenge here is being able to accurately identify these emotions in text.
  • Aspect-based Sentiment Analysis: This is generally used to understand specific aspects of a certain product or service. For example, in a review like “The battery life of this phone is great, but the screen is not very clear”, the sentiment towards the battery life is positive, but it’s negative towards the screen.
  • Multilingual sentiment analysis: This can be particularly challenging because the same word can convey different sentiments in different languages.
  • Intent Analysis: This goes a step further to understand the user’s intention behind a certain statement. For example, a statement like “I would need a car” might indicate a purchasing intent.

Sentiment analysis is a mind boggling task because of the innate vagueness of human language. Mockery, for example, is especially difficult to identify. Subsequently, the precision of opinion investigation generally relies upon the intricacy of the errand and the framework’s capacity to gain from a lot of information.

Theory Behind the Basics of NLP

Why Is Sentiment Analysis Important?

Sentiment analysis is important for several reasons:

  1. Business Intelligence: It helps businesses understand how their customers feel about their products or services. This can guide improvements, address customer concerns, and enhance overall customer satisfaction.
  2. Market Research: By analyzing public sentiment towards products, services, or brand mentions on social media, companies can gain insights into market trends and competitors.
  3. Customer Service: Sentiment analysis can help identify negative reviews or feedback in real-time, allowing for quicker responses and problem resolution.
  4. Product Analytics: It can be used to understand user feedback on various aspects of a product, helping drive product strategy and development.
  5. Public Relations: Sentiment analysis can help monitor public sentiment towards a company or individual, enabling proactive management of public relations.
  6. Politics and Public Policy: In politics, sentiment analysis is used to gauge public opinion towards policies or political entities, which can inform strategy and messaging.

Keep in mind, the objective of sentiment analysis using NLP isn’t simply to grasp opinion however to utilize that comprehension to accomplish explicit targets. It’s a useful asset, yet like any device, its worth comes from how it’s utilized.

Sentiment Analysis Challenges

Sentiment analysis, while powerful, comes with its own set of challenges:

  1. Sarcasm and Irony: These linguistic features can completely reverse the sentiment of a statement. Detecting sarcasm and irony is a complex task even for humans, and it’s even more challenging for AI systems.
  2. Contextual Understanding: The sentiment of certain words can change based on the context in which they’re used. For example, the word “sick” can have a negative connotation in a health-related context (“I’m feeling sick”) but can be positive in a different context (“That’s a sick beat!”).
  3. Negations and Double Negatives: Phrases like “not bad” or “not unimpressive” can be difficult to interpret correctly because they require understanding of double negatives and other linguistic nuances.
  4. Emojis and Slang: Text data, especially from social media, often contains emojis and slang. The sentiment of these can be hard to determine as their meanings can be subjective and vary across different cultures and communities.
  5. Multilingual Sentiment Analysis: Sentiment analysis becomes significantly more difficult when applied to multiple languages. Direct translation might not carry the same sentiment, and cultural differences can further complicate the analysis.
  6. Aspect-Based Sentiment Analysis: Determining sentiment towards specific aspects within a text can be challenging. For instance, a restaurant review might have a positive sentiment towards the food, but a negative sentiment towards the service.

These challenges highlight the complexity of human language and communication. Overcoming them requires advanced NLP techniques, deep learning models, and a large amount of diverse and well-labelled training data. Despite these challenges, sentiment analysis continues to be a rapidly evolving field with vast potential.

Applications of Sentiment Analysis

Sentiment Analysis has a wide range of applications across various domains. Here are some key applications:

  1. Customer Feedback: Businesses use sentiment analysis to process customer feedback and reviews. This helps them understand customer satisfaction and preferences, and make data-driven decisions.
  2. Social Media Monitoring: Brands monitor social media platforms to understand public sentiment about their products or services. This can help in reputation management and in identifying potential crises before they escalate.
  3. Market Research: Sentiment analysis can be used to understand public opinion about a product or a political event. This can provide valuable insights for market research.
  4. Product Analytics: Companies use sentiment analysis to gather insights from product reviews. This can guide product enhancements and innovations.
  5. Healthcare: In healthcare, sentiment analysis can be used to understand patient experiences and feedback about treatments, doctors, or hospitals.
  6. Finance: In the financial sector, sentiment analysis is used to gauge market sentiment. Traders and investors use this information to make informed decisions.
  7. Politics: In politics, sentiment analysis is used to understand public opinion about certain policies or politicians. This can guide political campaigns and strategies.
  8. Human Resources: HR departments use sentiment analysis to understand employee feedback and improve workplace culture.

Remember, these are just a few examples. The potential applications of sentiment analysis are vast and continue to grow with advancements in AI and machine learning technologies.

Step by Step procedure to Implement Sentiment Analysis

First, let’s import all the python libraries that we will use throughout the program.

Basic Python Libraries

1. Pandas – library for data analysis and data manipulation
2. Matplotlib – library used for data visualization
3. Seaborn – a library based on matplotlib and it provides a high-level interface for data visualization
4. WordCloud – library to visualize text data
5. re – provides functions to pre-process the strings as per the given regular expression

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud
import re

Natural Language Processing

1. nltk – Natural Language Toolkit is a collection of libraries for natural language processing

2. stopwords – a collection of words that don’t provide any meaning to a sentence

3. WordNetLemmatizer – used to convert different forms of words into a single item but still keeping the context intact.

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

Scikit-Learn (Machine Learning Library for Python)

1. CountVectorizer – transform text to vectors

2. GridSearchCV – for hyperparameter tuning

3. RandomForestClassifier – machine learning algorithm for classification

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

Evaluation Metrics

1. Accuracy Score – no. of correctly classified instances/total no. of instances

2. Precision Score – the ratio of correctly predicted instances over total positive instances

3. Recall Score – the ratio of correctly predicted instances over total instances in that class

4. Roc Curve – a plot of true positive rate against false positive rate

5. Classification Report – report of precision, recall and f1 score

6. Confusion Matrix – a table used to describe the classification models

from sklearn.metrics import accuracy_score,precision_score,recall_score,confusion_matrix,roc_curve,classification_report
from scikitplot.metrics import plot_confusion_matrix

Evaluate Dataset

We will use the dataset which is available on Kaggle for sentiment analysis using NLP, which consists of a sentence and its respective sentiment as a target variable. This dataset contains 3 separate files named train.txt, test.txt and val.txt.

You can find the dataset here.

Now, we will read the training data and validation data. As the data is in text format, separated by semicolons and without column names, we will create the data frame with read_csv() and parameters as “delimiter” and “names”.

df_train = pd.read_csv("train.txt",delimiter=';',names=['text','label'])
df_val = pd.read_csv("val.txt",delimiter=';',names=['text','label'])

Now, we will concatenate these two data frames, as we will be using cross-validation and we have a separate test dataset, so we don’t need a separate validation set of data. And, then we will reset the index to avoid duplicate indexes.

df = pd.concat([df_train,df_val])
df.reset_index(inplace=True,drop=True)

We can view a sample of the contents of the dataset using the “sample” method of pandas, and check the no. of records and features using the “shape” method.

import pandas as pd
df_train = pd.read_csv("train.txt",delimiter=';',names=['text','label'])
df_val = pd.read_csv("val.txt",delimiter=';',names=['text','label'])
df = pd.concat([df_train,df_val])
df.reset_index(inplace=True,drop=True)
print("Shape of the DataFrame:",df.shape)
print(df.sample(5))

Now, we will check for the various target labels in our dataset using seaborn.

sentiment analysis plot

As we can see that, we have 6 labels or targets in the dataset. We can make a multi-class classifier for Sentiment Analysis using NLP. But, for the sake of simplicity, we will merge these labels into two classes, i.e. Positive and Negative sentiment.

1. Positive Sentiment – “joy”,”love”,”surprise”

2. Negative Sentiment – “anger”,”sadness”,”fear”

Now, we will create a custom encoder to convert categorical target labels to numerical form, i.e. (0 and 1)

def custom_encoder(df):
    df.replace(to_replace ="surprise", value =1, inplace=True)
    df.replace(to_replace ="love", value =1, inplace=True)
    df.replace(to_replace ="joy", value =1, inplace=True)
    df.replace(to_replace ="fear", value =0, inplace=True)
    df.replace(to_replace ="anger", value =0, inplace=True)
    df.replace(to_replace ="sadness", value =0, inplace=True)
custom_encoder(df['label'])
sentiment analysis cloud

Now, we can see that our target has changed to 0 and 1,i.e. 0 for Negative and 1 for Positive, and the data is more or less in a balanced state.

Data Pre-processing

Now, we will perform some pre-processing on the data before converting it into vectors and passing it to the machine learning model.

We will create a function for pre-processing of data.

1. First, we will iterate through each record, and using a regular expression, we will get rid of any characters apart from alphabets.

2. Then, we will convert the string to lowercase as, the word “Good” is different from the word “good”.

Because, without converting to lowercase, it will cause an issue when we will create vectors of these words, as two different vectors will be created for the same word which we don’t want to.

3. Then we will check for stopwords in the data and get rid of them. Stopwords are commonly used words in a sentence such as “the”, “an”, “to” etc. which do not add much value.

4. Then, we will perform lemmatization on each word,i.e. change the different forms of a word into a single item called a lemma.

lemma is a base form of a word. For example, “run”, “running” and “runs” are all forms of the same lexeme, where the “run” is the lemma. Hence, we are converting all occurrences of the same lexeme to their respective lemma.

5. And, then return a corpus of processed data.

But first, we will create an object of WordNetLemmatizer and then we will perform the transformation.

#object of WordNetLemmatizer
lm = WordNetLemmatizer()
def text_transformation(df_col):
    corpus = []
    for item in df_col:
        new_item = re.sub('[^a-zA-Z]',' ',str(item))
        new_item = new_item.lower()
        new_item = new_item.split()
        new_item = [lm.lemmatize(word) for word in new_item if word not in set(stopwords.words('english'))]
        corpus.append(' '.join(str(x) for x in new_item))
    return corpus
corpus = text_transformation(df['text'])

Now, we will create a Word Cloud. It is a data visualization technique used to depict text in such a way that, the more frequent words appear enlarged as compared to less frequent words. This gives us a little insight into, how the data looks after being processed through all the steps until now.

rcParams['figure.figsize'] = 20,8
word_cloud = ""
for row in corpus:
    for word in row:
        word_cloud+=" ".join(word)
wordcloud = WordCloud(width = 1000, height = 500,background_color ='white',min_font_size = 10).generate(word_cloud)
plt.imshow(wordcloud)

Output:

word cloud

Bag of Words

Now, we will use the Bag of Words Model(BOW), which is used to represent the text in the form of a bag of words,i.e. the grammar and the order of words in a sentence are not given any importance, instead, multiplicity,i.e. (the number of times a word occurs in a document) is the main point of concern.

Basically, it describes the total occurrence of words within a document.

Scikit-Learn provides a neat way of performing the bag of words technique using CountVectorizer.

Now, we will convert the text data into vectors, by fitting and transforming the corpus that we have created.

cv = CountVectorizer(ngram_range=(1,2))
traindata = cv.fit_transform(corpus)
X = traindata
y = df.label

We will take ngram_range as (1,2) which signifies a bigram.

Ngram is a sequence of ‘n’ of words in a row or sentence. ‘ngram_range’ is a parameter, which we use to give importance to the combination of words, such as, “social media” has a different meaning than “social” and “media” separately.

We can experiment with the value of the ngram_range parameter and select the option which gives better results.

Now comes the machine learning model creation part and in this project, I’m going to use Random Forest Classifier, and we will tune the hyperparameters using GridSearchCV.

GridSearchCV() will take the following parameters,

1. Estimator or model – RandomForestClassifier in our case

2. parameters – dictionary of hyperparameter names and their values

3. cv – signifies cross-validation folds

4. return_train_score – returns the training scores of the various models

5. n_jobs – no. of jobs to run parallelly (“-1” signifies that all CPU cores will be used which reduces the training time drastically)

First, We will create a dictionary, “parameters” which will contain the values of different hyperparameters.

We will pass this as a parameter to GridSearchCV to train our random forest classifier model using all possible combinations of these parameters to find the best model.

parameters = {'max_features': ('auto','sqrt'),
             'n_estimators': [500, 1000, 1500],
             'max_depth': [5, 10, None],
             'min_samples_split': [5, 10, 15],
             'min_samples_leaf': [1, 2, 5, 10],
             'bootstrap': [True, False]}

Now, we will fit the data into the grid search and view the best parameter using the “best_params_” attribute of GridSearchCV.

grid_search = GridSearchCV(RandomForestClassifier(),parameters,cv=5,return_train_score=True,n_jobs=-1)
grid_search.fit(X,y)
grid_search.best_params_

Output:

grid search

And then, we can view all the models and their respective parameters, mean test score and rank as  GridSearchCV stores all the results in the cv_results_ attribute.

for i in range(432):
    print('Parameters: ',grid_search.cv_results_['params'][i])
    print('Mean Test Score: ',grid_search.cv_results_['mean_test_score'][i])
    print('Rank: ',grid_search.cv_results_['rank_test_score'][i])

Output: (a sample of the output)

sentiment analysis - sample output

Now, we will choose the best parameters obtained from GridSearchCV and create a final random forest classifier model and then train our new model.

rfc = RandomForestClassifier(max_features=grid_search.best_params_['max_features'],
                                      max_depth=grid_search.best_params_['max_depth'],
                                      n_estimators=grid_search.best_params_['n_estimators'],
                                      min_samples_split=grid_search.best_params_['min_samples_split'],
                                      min_samples_leaf=grid_search.best_params_['min_samples_leaf'],
                                      bootstrap=grid_search.best_params_['bootstrap'])
rfc.fit(X,y)

Test Data Transformation

Now, we will read the test data and perform the same transformations we did on training data and finally evaluate the model on its predictions.

test_df = pd.read_csv('test.txt',delimiter=';',names=['text','label'])
X_test,y_test = test_df.text,test_df.label
#encode the labels into two classes , 0 and 1
test_df = custom_encoder(y_test)
#pre-processing of text
test_corpus = text_transformation(X_test)
#convert text data into vectors
testdata = cv.transform(test_corpus)
#predict the target
predictions = rfc.predict(testdata)

Model Evaluation

We will evaluate our model using various metrics such as Accuracy Score, Precision Score, Recall Score, Confusion Matrix and create a roc curve to visualize how our model performed.

rcParams['figure.figsize'] = 10,5
plot_confusion_matrix(y_test,predictions)
acc_score = accuracy_score(y_test,predictions)
pre_score = precision_score(y_test,predictions)
rec_score = recall_score(y_test,predictions)
print('Accuracy_score: ',acc_score)
print('Precision_score: ',pre_score)
print('Recall_score: ',rec_score)
print("-"*50)
cr = classification_report(y_test,predictions)
print(cr)

Output:

output

Confusion Matrix:

confusion matrix

Roc Curve

We will find the probability of the class using the predict_proba() method of Random Forest Classifier and then we will plot the roc curve.

predictions_probability = rfc.predict_proba(testdata)
fpr,tpr,thresholds = roc_curve(y_test,predictions_probability[:,1])
plt.plot(fpr,tpr)
plt.plot([0,1])
plt.title('ROC Curve')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.show()

As we can see that our model performed very well in classifying the sentiments, with an Accuracy score, Precision and  Recall of approx 96%. And the roc curve and confusion matrix are great as well which means that our model is able to classify the labels accurately, with fewer chances of error.

Now, we will check for custom input as well and let our model identify the sentiment of the input statement.

Predict for Custom Input:

def expression_check(prediction_input):
    if prediction_input == 0:
        print("Input statement has Negative Sentiment.")
    elif prediction_input == 1:
        print("Input statement has Positive Sentiment.")
    else:
        print("Invalid Statement.")
# function to take the input statement and perform the same transformations we did earlier
def sentiment_predictor(input):
    input = text_transformation(input)
    transformed_input = cv.transform(input)
    prediction = rfc.predict(transformed_input)
    expression_check(prediction)
input1 = ["Sometimes I just want to punch someone in the face."]
input2 = ["I bought a new phone and it's so good."]
sentiment_predictor(input1)
sentiment_predictor(input2)

Output:

negetive

Hurray, As we can see that our model accurately classified the sentiments behind the two sentences.

Connclusion

Sentiment analysis using NLP stands as a powerful tool in deciphering the complex landscape of human emotions embedded within textual data. By leveraging various techniques and methodologies, analysts can extract valuable insights, ranging from consumer preferences to political sentiment, thereby informing decision-making processes across diverse domains. As we conclude this journey through sentiment analysis, it becomes evident that its significance transcends industries, offering a lens through which we can better comprehend and navigate the digital realm.

Frequently Asked Questions

Q1. What is sentiment analysis in NLP?

A. Sentiment analysis in NLP (Natural Language Processing) is the process of determining the sentiment or emotion expressed in a piece of text, such as positive, negative, or neutral. It involves using machine learning algorithms and linguistic techniques to analyze and classify subjective information. Sentiment analysis finds applications in social media monitoring, customer feedback analysis, market research, and other areas where understanding sentiment is crucial.

Q2. Which algorithm is used for sentiment analysis?

A. Several algorithms are commonly used for sentiment analysis, including:
1. Naive Bayes Classifier: Based on Bayes’ theorem, it calculates the probability of a text belonging to a specific sentiment class.
2. Support Vector Machines (SVM): A machine learning algorithm that separates data into different classes using hyperplanes.
3. Recurrent Neural Networks (RNN): Particularly LSTM (Long Short-Term Memory) models, which capture sequential information in text data.
4. Convolutional Neural Networks (CNN): Effective for capturing local patterns in text through convolutional filters.
5. Decision Trees: Constructed based on features of the text to classify sentiments. The choice of algorithm depends on the specific requirements and characteristics of the sentiment analysis task.

Q3.How is AI used in sentiment analysis?

1.AI helps understand human emotions in text.
2. NLP breaks down text into parts and analyzes grammar.
3. ML learns from labeled data to identify patterns.
4.AI can capture not just polarity (positive or negative) but also the intensity of sentiment.
5. Applications include customer feedback analysis, brand reputation monitoring, product development, political analysis, and risk assessment

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Nikhil Raj 13 Mar 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Related Courses

Natural Language Processing
Become a full stack data scientist