Sentiment Analysis with LSTM

8 min read

Introduction

Sentiment Analysis is a powerful application of Natural Language Processing (NLP) that identifies the emotional tone of text. Classifying text into positive, negative, or neutral sentiments serves various industries, from social media monitoring to market research. This article demonstrates how to perform sentiment analysis on IMDB movie reviews using Long-Short-Term Memory (LSTM) networks.

Learning Outcomes: 

  • Grasp the fundamentals of sentiment analysis, its applications, and how it classifies text into positive, negative, or neutral categories.
  • Learn about Long Short-Term Memory (LSTM) networks, their role in handling sequential data, and their advantages over standard RNNs.
  • Build and train sentiment analysis model with LSTM using Keras, including tokenization, padding sequences, and setting model hyperparameters.
  • Learn to assess model performance using accuracy metrics and improve it through hyperparameter tuning and extended training.
  • Apply the trained model to predict sentiments for new, unseen movie reviews, handling tokenization and input padding.

This article was published as a part of the Data Science Blogathon.

What is Sentiment Analysis?

Sentiment Analysis is an NLP application that identifies a text corpus’s emotional or sentimental tone or opinion. Usually, emotions or attitudes toward a topic can be positive, negative, or neutral. This makes sentiment analysis a text classification task. Examples of positive, negative, and neutral expressions are:

“I enjoyed the movie!” – Positive

“I am not sure if I liked the movie.” – Neutral

“It was the most terrible movie I have ever seen.” – Negative

Sentiment Analysis with LSTM

Sentiment analysis is a potent tool with varied applications across industries. It is helpful for social media and brand monitoring, customer support and feedback analysis, market research, etc. By performing sentiment analysis on initial customer feedback, you can identify a new product’s target audience or demographics and evaluate the success of a marketing campaign. As sentiment analysis grows increasingly useful in the industry, we must learn how to perform it.

Sentiment Analysis using LSTM

What is LSTM? 

Recurrent neural networks (RNNs) are a form of Artificial Neural networks that can memorize arbitrary-length sequences of input patterns by capturing connections between sequential data types. However, due to stochastic gradients’ failure, RNNs cannot detect long-term dependencies in lengthy sequences. Researchers proposed several novel RNN models, notably LSTM, to address this issue. LSTM networks are extensions of RNNs designed to learn sequential (temporal) data and their long-term connections more precisely than standard RNNs. They commonly find use in deep learning applications such as stock forecasting, speech recognition, and natural language processing.

Loading the Dataset 

We will analyze sentiment in 50k IMDB movie reviews, comprising 25k positive and 25k negative reviews, ensuring a balanced dataset. You can download the dataset from here. We start by importing the necessary packages for text manipulation and model building.

import re
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import keras
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score
import math
import nltk

We load the dataset into a pandas dataframe with the help of the following code :

data = pd.read_csv('IMDB Dataset.csv')
data

The data looks like this :

Sentiment Analysis with LSTM

Data Preprocessing

First step in sentiment analysis with LSTM is to remove HTML tags, URLs, and non-alphanumeric characters from the reviews. We do that with the help of the remove_tags function, and Regex functions are used for easy string manipulation.

def remove_tags(string):
    removelist = ""
    result = re.sub('','',string)          #remove HTML tags
    result = re.sub('https://.*','',result)   #remove URLs
    result = re.sub(r'[^w'+removelist+']', ' ',result)    #remove non-alphanumeric characters 
    result = result.lower()
    return result
data['review']=data['review'].apply(lambda cw : remove_tags(cw))

We also need to remove stopwords from the corpus. Commonly used words like ‘and’, ‘the’, and ‘at’ are stopwords that do not add any special meaning or significance to a sentence. NLTK provides a list of stopwords, and you can remove them from the corpus using the following code:

nltk.download('stopwords')
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
data['review'] = data['review'].apply(lambda x: ' '.join([word for word in x.split() if word not in (stop_words)]))

We now perform lemmatization on the text. Lemmatization is a useful technique in NLP to obtain the root form of words, known as lemmas. For example, the words “reading,” “reads,” and “read” all lemma to “read.” This approach saves unnecessary computational overhead in deciphering entire words, as their meanings are well-expressed by their lemmas. We perform lemmatization using the WordNetLemmatizer() from nltk. The text is first broken into words using the WhitespaceTokenizer() from nltk. We write a function lemmatize_text to perform lemmatization on the individual tokens.

w_tokenizer = nltk.tokenize.WhitespaceTokenizer()
lemmatizer = nltk.stem.WordNetLemmatizer()
def lemmatize_text(text):
    st = ""
    for w in w_tokenizer.tokenize(text):
        st = st + lemmatizer.lemmatize(w) + " "
    return st
data['review'] = data.review.apply(lemmatize_text)
data

The processed data for the LSTM model for sentiment analysis looks like this :

Sentiment Analysis with LSTM

The next step in sentiment analysis with LSTM is to print some basic statistics about the dataset and check if it has an equal number of all labels to ensure balance. Ideally, a balanced dataset is preferable, as a severely imbalanced dataset can be challenging to model and require specialized techniques.

Also Read: 10 Techniques to Solve Imbalanced Classes in Machine Learning (Updated 2024)

s = 0.0
for i in data['review']:
    word_list = i.split()
    s = s + len(word_list)
print("Average length of each review : ",s/data.shape[0])
pos = 0
for i in range(data.shape[0]):
    if data.iloc[i]['sentiment'] == 'positive':
        pos = pos + 1
neg = data.shape[0]-pos
print("Percentage of reviews with positive sentiment is "+str(pos/data.shape[0]*100)+"%")
print("Percentage of reviews with negative sentiment is "+str(neg/data.shape[0]*100)+"%")
>>Average length of each review :  119.57112
>>Percentage of reviews with positive sentiment is 50.0%
>>Percentage of reviews with negative sentiment is 50.0%

Encoding Labels and Making Train-Test Splits

In this step of sentiment analysis using LSTM, we use the LabelEncoder() from sklearn.preprocessing to convert the labels (‘positive’, ‘negative’) into 1’s and 0’s respectively.

reviews = data['review'].values
labels = data['sentiment'].values
encoder = LabelEncoder()
encoded_labels = encoder.fit_transform(labels)

Finally, we split the dataset into train and test parts using train_test_split from sklearn.model_selection. We use 80% of the dataset for training and 20% for testing.

train_sentences, test_sentences, train_labels, test_labels = train_test_split(reviews, encoded_labels, stratify = encoded_labels)

Before feeding into the LSTM model for sentiment analysis, we must pad and tokenize the data.

  • Tokenizing: Keras‘ built-in tokenizer API fits the dataset. It splits the sentences into words and creates a dictionary of all unique words found and their uniquely assigned integers. Each sentence is converted into an array of integers representing all the individual words.
  • Sequence Padding: We fill the array representing each sentence in the dataset with zeroes on the left to make the array size ten and bring all collections to the same length.
# Hyperparameters of the model
vocab_size = 3000 # choose based on statistics
oov_tok = ''
embedding_dim = 100
max_length = 200 # choose based on statistics, for example 150 to 200
padding_type='post'
trunc_type='post'
# tokenize sentences
tokenizer = Tokenizer(num_words = vocab_size, oov_token=oov_tok)
tokenizer.fit_on_texts(train_sentences)
word_index = tokenizer.word_index
# convert train dataset to sequence and pad sequences
train_sequences = tokenizer.texts_to_sequences(train_sentences)
train_padded = pad_sequences(train_sequences, padding='post', maxlen=max_length)
# convert Test dataset to sequence and pad sequences
test_sequences = tokenizer.texts_to_sequences(test_sentences)
test_padded = pad_sequences(test_sequences, padding='post', maxlen=max_length)

Building the Model

The next step in sentiment analysis using LSTM is to build a Keras sequential model. It is a linear stack of the following layers :

  • An embedding layer of dimension 100 converts each word in the sentence into a fixed-length dense vector of size 100. The input dimension is the vocabulary size, and the output dimension is 100. Hence, each word in the input will be represented by a vector of size 100.
  • A bidirectional LSTM layer of 64 units.
  • A dense (fully connected) layer of 24 units with relu activation.
  • A dense layer of 1 unit and sigmoid activation outputs the probability of the review is positive, i.e., if the label is 1.

The code for building the model :

# model initialization
model = keras.Sequential([
    keras.layers.Embedding(vocab_size, embedding_dim, input_length=max_length),
    keras.layers.Bidirectional(keras.layers.LSTM(64)),
    keras.layers.Dense(24, activation='relu'),
    keras.layers.Dense(1, activation='sigmoid')
])
# compile model
model.compile(loss='binary_crossentropy',
              optimizer='adam',
              metrics=['accuracy'])
# model summary
model.summary()

We compile the LSTM model for sentiment analysis with binary cross-entropy loss and the Adam optimizer, given that we have a binary classification problem. The Adam optimizer uses stochastic gradient descent to train deep learning models, and it compares the predicted probabilities to the actual class label (0 or 1). We use accuracy as the primary performance metric. You can see the model summary below:

Building the LSTM model using sentiment analysis

Model Training and Evaluation

Now, let us train the sentiment analysis model using LSTM for five epochs.

num_epochs = 5
history = model.fit(train_padded, train_labels, 
                    epochs=num_epochs, verbose=1, 
                    validation_split=0.1)

We evaluate the LSTM model for sentiment analysis by calculating its accuracy. We determine classification accuracy by dividing the number of correct predictions by the total number of predictions.

4o

prediction = model.predict(test_padded)
# Get labels based on probability 1 if p>= 0.5 else 0
pred_labels = []
for i in prediction:
    if i >= 0.5:
        pred_labels.append(1)
    else:
        pred_labels.append(0)
print("Accuracy of prediction on test set : ", accuracy_score(test_labels,pred_labels))

The prediction accuracy on the test set is 87.27%! You can improve the accuracy further by playing around with the model hyperparameters, tuning the model architecture, or changing the train-test split ratio. You should also train the model for a more significant number of epochs, and we stopped at five epochs because of the computational time. Ideally, this would help prepare the model until the train and test losses converge.

Using the Model to Determine the Sentiment of Unseen Movie Reviews

We can use our trained LSTM model for sentiment analysis to determine the sentiment of new unseen movie reviews that are not present in the dataset. Before feeding each new text as input to the model, you must tokenize and pad it. The model.predict() function returns the probability of the positive review. If the probability is more significant than 0.5, we consider the study positive; otherwise, it is negative.

# reviews on which we need to predict
sentence = ["The movie was very touching and heart whelming", 
            "I have never seen a terrible movie like this", 
            "the movie plot is terrible but it had good acting"]
# convert to a sequence
sequences = tokenizer.texts_to_sequences(sentence)
# pad the sequence
padded = pad_sequences(sequences, padding='post', maxlen=max_length)
# Get labels based on probability 1 if p>= 0.5 else 0
prediction = model.predict(padded)
pred_labels = []
for i in prediction:
    if i >= 0.5:
        pred_labels.append(1)
    else:
        pred_labels.append(0)
for i in range(len(sentence)):
    print(sentence[i])
    if pred_labels[i] == 1:
        s = 'Positive'
    else:
        s = 'Negative'
    print("Predicted sentiment : ",s)

The output looks very promising!

Conclusion

We demonstrated how to perform sentiment analysis with Long-Short-Term Memory (LSTM) networks on IMDB movie reviews. LSTM networks are Recurrent Neural Networks (RNNs) adept at handling sequential data and capturing long-term dependencies. Sentiment analysis, combined with LSTM networks, provides a powerful framework for understanding and leveraging the emotional tones in textual data. This capability is invaluable for making data-driven decisions in business and research contexts.

Key Takeaways

  • Sentiment analysis categorizes text emotions into positive, negative, or neutral, aiding applications like customer feedback analysis and market research.
  • LSTMs are advanced RNNs mainly to handle long-term dependencies in sequential data. They outperform standard RNNs in various tasks, including sentiment analysis.
  • Effective text preprocessing involves removing unwanted elements like HTML tags and stopwords and converting text to root forms using lemmatization.
  • Keras can construct an LSTM model with embedding, bidirectional LSTM, and dense layers, followed by training on labeled data.
  • The trained model can predict the sentiment of new reviews, providing a practical tool for automated sentiment detection in various domains.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion. 

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Randell Berry
Randell Berry 13 Feb, 2022

Thanks for the paper. Good read .

Bhavin Sutaria
Bhavin Sutaria 15 Nov, 2023

this is amazing articles you have write it but now there are few updates 1) for pad sequence library is update from "from keras.preprocessing.sequence import pad_sequences" to this "from tensorflow.keras.preprocessing.sequence import pad_sequences" 2) and need to update reguler expression for remove non-alphanumeric characters from this "result = re.sub(r'[^w'+removelist+']', ' ',result)" to this "result = re.sub(r'[^ \w'+removelist+']', ' ', result)"