Sentiment Analysis with LSTM

Koushiki Last Updated : 13 Jun, 2024

8 min read

Introduction

Sentiment Analysis is a powerful application of Natural Language Processing (NLP) that identifies the emotional tone of text. Classifying text into positive, negative, or neutral sentiments serves various industries, from social media monitoring to market research. This article demonstrates how to perform sentiment analysis on IMDB movie reviews using Long-Short-Term Memory (LSTM) networks.

Learning Outcomes:

Grasp the fundamentals of sentiment analysis, its applications, and how it classifies text into positive, negative, or neutral categories.
Learn about Long Short-Term Memory (LSTM) networks, their role in handling sequential data, and their advantages over standard RNNs.
Build and train sentiment analysis model with LSTM using Keras, including tokenization, padding sequences, and setting model hyperparameters.
Learn to assess model performance using accuracy metrics and improve it through hyperparameter tuning and extended training.
Apply the trained model to predict sentiments for new, unseen movie reviews, handling tokenization and input padding.

This article was published as a part of the Data Science Blogathon.

Introduction
What is Sentiment Analysis?
What is LSTM?
Loading the Dataset
Data Preprocessing
Encoding Labels and Making Train-Test Splits
Building the Model
Model Training and Evaluation
Using the Model to Determine the Sentiment of Unseen Movie Reviews
Conclusion

What is Sentiment Analysis?

Sentiment Analysis is an NLP application that identifies a text corpus’s emotional or sentimental tone or opinion. Usually, emotions or attitudes toward a topic can be positive, negative, or neutral. This makes sentiment analysis a text classification task. Examples of positive, negative, and neutral expressions are:

“I enjoyed the movie!” – Positive

“I am not sure if I liked the movie.” – Neutral

“It was the most terrible movie I have ever seen.” – Negative

Sentiment analysis is a potent tool with varied applications across industries. It is helpful for social media and brand monitoring, customer support and feedback analysis, market research, etc. By performing sentiment analysis on initial customer feedback, you can identify a new product’s target audience or demographics and evaluate the success of a marketing campaign. As sentiment analysis grows increasingly useful in the industry, we must learn how to perform it.

What is LSTM?

Recurrent neural networks (RNNs) are a form of Artificial Neural networks that can memorize arbitrary-length sequences of input patterns by capturing connections between sequential data types. However, due to stochastic gradients’ failure, RNNs cannot detect long-term dependencies in lengthy sequences. Researchers proposed several novel RNN models, notably LSTM, to address this issue. LSTM networks are extensions of RNNs designed to learn sequential (temporal) data and their long-term connections more precisely than standard RNNs. They commonly find use in deep learning applications such as stock forecasting, speech recognition, and natural language processing.

Loading the Dataset

We will analyze sentiment in 50k IMDB movie reviews, comprising 25k positive and 25k negative reviews, ensuring a balanced dataset. You can download the dataset from here. We start by importing the necessary packages for text manipulation and model building.

import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import math
import nltk

We load the dataset into a pandas dataframe with the help of the following code :

data = pd.read_csv('IMDB Dataset.csv')
data

The data looks like this :

Data Preprocessing

First step in sentiment analysis with LSTM is to remove HTML tags, URLs, and non-alphanumeric characters from the reviews. We do that with the help of the remove_tags function, and Regex functions are used for easy string manipulation.

def remove_tags(string):
    removelist = ""
    result = re.sub('','',string)          #remove HTML tags
    result = re.sub('https://.*','',result)   #remove URLs
    result = re.sub(r'[^w'+removelist+']', ' ',result)    #remove non-alphanumeric characters 
    result = result.lower()
    return result
data['review']=data['review'].apply(lambda cw : remove_tags(cw))

We also need to remove stopwords from the corpus. Commonly used words like ‘and’, ‘the’, and ‘at’ are stopwords that do not add any special meaning or significance to a sentence. NLTK provides a list of stopwords, and you can remove them from the corpus using the following code:

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
data['review'] = data['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

We now perform lemmatization on the text. Lemmatization is a useful technique in NLP to obtain the root form of words, known as lemmas. For example, the words “reading,” “reads,” and “read” all lemma to “read.” This approach saves unnecessary computational overhead in deciphering entire words, as their meanings are well-expressed by their lemmas. We perform lemmatization using the WordNetLemmatizer() from nltk. The text is first broken into words using the WhitespaceTokenizer() from nltk. We write a function lemmatize_text to perform lemmatization on the individual tokens.

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    st = ""
    for w in w_tokenizer.tokenize(text):
        st = st + lemmatizer.lemmatize(w) + " "
    return st
data['review'] = data.review.apply(lemmatize_text)
data

The processed data for the LSTM model for sentiment analysis looks like this :

The next step in sentiment analysis with LSTM is to print some basic statistics about the dataset and check if it has an equal number of all labels to ensure balance. Ideally, a balanced dataset is preferable, as a severely imbalanced dataset can be challenging to model and require specialized techniques.

Also Read: 10 Techniques to Solve Imbalanced Classes in Machine Learning (Updated 2024)

s = 0.0
for i in data['review']:
    word_list = i.split()
    s = s + len(word_list)
print("Average length of each review : ",s/data.shape[0])
pos = 0
for i in range(data.shape[0]):
    if data.iloc[i]['sentiment'] == 'positive':
        pos = pos + 1
neg = data.shape[0]-pos
print("Percentage of reviews with positive sentiment is "+str(pos/data.shape[0]*100)+"%")
print("Percentage of reviews with negative sentiment is "+str(neg/data.shape[0]*100)+"%")
>>Average length of each review :  119.57112
>>Percentage of reviews with positive sentiment is 50.0%
>>Percentage of reviews with negative sentiment is 50.0%

Encoding Labels and Making Train-Test Splits

In this step of sentiment analysis using LSTM, we use the LabelEncoder() from sklearn.preprocessing to convert the labels (‘positive’, ‘negative’) into 1’s and 0’s respectively.

reviews = data['review'].values
labels = data['sentiment'].values
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

Finally, we split the dataset into train and test parts using train_test_split from sklearn.model_selection. We use 80% of the dataset for training and 20% for testing.

train_sentences, test_sentences, train_labels, test_labels = train_test_split(reviews, encoded_labels, stratify = encoded_labels)

Before feeding into the LSTM model for sentiment analysis, we must pad and tokenize the data.

Tokenizing: Keras‘ built-in tokenizer API fits the dataset. It splits the sentences into words and creates a dictionary of all unique words found and their uniquely assigned integers. Each sentence is converted into an array of integers representing all the individual words.
Sequence Padding: We fill the array representing each sentence in the dataset with zeroes on the left to make the array size ten and bring all collections to the same length.

# Hyperparameters of the model
vocab_size = 3000 # choose based on statistics
oov_tok = ''
embedding_dim = 100
max_length = 200 # choose based on statistics, for example 150 to 200
padding_type='post'
trunc_type='post'
# tokenize sentences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
# convert train dataset to sequence and pad sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)
# convert Test dataset to sequence and pad sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

Building the Model

The next step in sentiment analysis using LSTM is to build a Keras sequential model. It is a linear stack of the following layers :

An embedding layer of dimension 100 converts each word in the sentence into a fixed-length dense vector of size 100. The input dimension is the vocabulary size, and the output dimension is 100. Hence, each word in the input will be represented by a vector of size 100.
A bidirectional LSTM layer of 64 units.
A dense (fully connected) layer of 24 units with relu activation.
A dense layer of 1 unit and sigmoid activation outputs the probability of the review is positive, i.e., if the label is 1.

The code for building the model :

# model initialization
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(24, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])
# compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
# model summary
model.summary()

We compile the LSTM model for sentiment analysis with binary cross-entropy loss and the Adam optimizer, given that we have a binary classification problem. The Adam optimizer uses stochastic gradient descent to train deep learning models, and it compares the predicted probabilities to the actual class label (0 or 1). We use accuracy as the primary performance metric. You can see the model summary below:

Building the LSTM model using sentiment analysis

Model Training and Evaluation

Now, let us train the sentiment analysis model using LSTM for five epochs.

num_epochs = 5
history = model.fit(train_padded, train_labels, 
                    epochs=num_epochs, verbose=1, 
                    validation_split=0.1)

We evaluate the LSTM model for sentiment analysis by calculating its accuracy. We determine classification accuracy by dividing the number of correct predictions by the total number of predictions.

prediction = model.predict(test_padded)
# Get labels based on probability 1 if p>= 0.5 else 0
pred_labels = []
for i in prediction:
    if i >= 0.5:
        pred_labels.append(1)
    else:
        pred_labels.append(0)
print("Accuracy of prediction on test set : ", accuracy_score(test_labels,pred_labels))

The prediction accuracy on the test set is 87.27%! You can improve the accuracy further by playing around with the model hyperparameters, tuning the model architecture, or changing the train-test split ratio. You should also train the model for a more significant number of epochs, and we stopped at five epochs because of the computational time. Ideally, this would help prepare the model until the train and test losses converge.

Using the Model to Determine the Sentiment of Unseen Movie Reviews

We can use our trained LSTM model for sentiment analysis to determine the sentiment of new unseen movie reviews that are not present in the dataset. Before feeding each new text as input to the model, you must tokenize and pad it. The model.predict() function returns the probability of the positive review. If the probability is more significant than 0.5, we consider the study positive; otherwise, it is negative.

# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming", 
            "I have never seen a terrible movie like this", 
            "the movie plot is terrible but it had good acting"]
# convert to a sequence
sequences = tokenizer.texts_to_sequences(sentence)
# pad the sequence
padded = pad_sequences(sequences, padding='post', maxlen=max_length)
# Get labels based on probability 1 if p>= 0.5 else 0
prediction = model.predict(padded)
pred_labels = []
for i in prediction:
    if i >= 0.5:
        pred_labels.append(1)
    else:
        pred_labels.append(0)
for i in range(len(sentence)):
    print(sentence[i])
    if pred_labels[i] == 1:
        s = 'Positive'
    else:
        s = 'Negative'
    print("Predicted sentiment : ",s)

The output looks very promising!

Conclusion

We demonstrated how to perform sentiment analysis with Long-Short-Term Memory (LSTM) networks on IMDB movie reviews. LSTM networks are Recurrent Neural Networks (RNNs) adept at handling sequential data and capturing long-term dependencies. Sentiment analysis, combined with LSTM networks, provides a powerful framework for understanding and leveraging the emotional tones in textual data. This capability is invaluable for making data-driven decisions in business and research contexts.

Key Takeaways

Sentiment analysis categorizes text emotions into positive, negative, or neutral, aiding applications like customer feedback analysis and market research.
LSTMs are advanced RNNs mainly to handle long-term dependencies in sequential data. They outperform standard RNNs in various tasks, including sentiment analysis.
Effective text preprocessing involves removing unwanted elements like HTML tags and stopwords and converting text to root forms using lemmatization.
Keras can construct an LSTM model with embedding, bidirectional LSTM, and dense layers, followed by training on labeled data.
The trained model can predict the sentiment of new reviews, providing a practical tool for automated sentiment detection in various domains.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Koushiki

Beginner Datasets Deep Learning NLP Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Randell Berry

Thanks for the paper. Good read .

Bhavin Sutaria

this is amazing articles you have write it but now there are few updates 1) for pad sequence library is update from "from keras.preprocessing.sequence import pad_sequences" to this "from tensorflow.keras.preprocessing.sequence import pad_sequences" 2) and need to update reguler expression for remove non-alphanumeric characters from this "result = re.sub(r'[^w'+removelist+']', ' ',result)" to this "result = re.sub(r'[^ \w'+removelist+']', ' ', result)"

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Sentiment Analysis with LSTM

Introduction

Table of contents

What is Sentiment Analysis?

What is LSTM?

Loading the Dataset

Data Preprocessing

Encoding Labels and Making Train-Test Splits

Building the Model

Model Training and Evaluation

Using the Model to Determine the Sentiment of Unseen Movie Reviews

Conclusion

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm