An NLP Approach to Mining Online Reviews using Topic Modeling (with Python codes)

Prateek Joshi 26 Jul, 2022

9 min read

Introduction

E-commerce has revolutionized the way we shop. That phone you’ve been saving up to buy for months? It’s just a search and a few clicks away. Items are delivered within a matter of days (sometimes even the next day!).

For online retailers, there are no constraints related to inventory management or space management They can sell as many different products as they want. Brick and mortar stores can keep only a limited number of products due to the finite space they have available.

I remember when I used to place orders for books at my local bookstore, and it used to take over a week for the book to arrive. It seems like a story from the ancient times now!

Source: http://www.yeebaplay.com.br

But online shopping comes with its own caveats. One of the biggest challenges is verifying the authenticity of a product. Is it as good as advertised on the e-commerce site? Will the product last more than a year? Are the reviews given by other customers really true or are they false advertising? These are important questions customers need to ask before splurging their money.

This is a great place to experiment and apply Natural Language Processing (NLP) techniques. This article will help you understand the significance of harnessing online product reviews with the help of Topic Modeling.

Please go through the below articles in case you need a quick refresher on Topic Modeling:

Importance of Online Reviews

A few days back, I took the e-commerce plunge and purchased a smartphone online. It was well within my budget, and it had an above decent rating of 4.5 out of 5.

Unfortunately, it turned out to be a bad decision as the battery backup was well below par. I didn’t go through the reviews of the product and made a hasty decision to buy it based on its ratings alone. And I know I’m not the only one out there who made this mistake!

Ratings alone do not give a complete picture of the products we wish to purchase, as I found to my detriment. So, as a precautionary measure, I always advise people to read a product’s reviews before deciding whether to buy it or not.

But then an interesting problem comes up. What if the number of reviews is in the hundreds or thousands? It’s just not feasible to go through all those reviews, right? And this is where natural language processing comes up trumps.

Setting the Problem Statement

A problem statement is the seed from which your analysis blooms. Therefore, it is really important to have a solid, clear and well-defined problem statement.

How we can analyze a large number of online reviews using Natural Language Processing (NLP)? Let’s define this problem.

Online product reviews are a great source of information for consumers. From the sellers’ point of view, online reviews can be used to gauge the consumers’ feedback on the products or services they are selling. However, since these online reviews are quite often overwhelming in terms of numbers and information, an intelligent system, capable of finding key insights (topics) from these reviews, will be of great help for both the consumers and the sellers. This system will serve two purposes:

Enable consumers to quickly extract the key topics covered by the reviews without having to go through all of them
Help the sellers/retailers get consumer feedback in the form of topics (extracted from the consumer reviews)

To solve this task, we will use the concept of Topic Modeling (LDA) on Amazon Automotive Review data. You can download it from this link. Similar datasets for other categories of products can be found here.

Why Should you use Topic Modeling for this task?

As the name suggests, Topic Modeling is a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Topic Models are very useful for multiple purposes, including:

Document clustering
Organizing large blocks of textual data
Information retrieval from unstructured text
Feature selection

A good topic model, when trained on some text about the stock market, should result in topics like “bid”, “trading”, “dividend”, “exchange”, etc. The below image illustrates how a typical topic model works:

In our case, instead of text documents, we have thousands of online product reviews for the items listed under the ‘Automotive’ category. Our aim here is to extract a certain number of groups of important words from the reviews. These groups of words are basically the topics which would help in ascertaining what the consumers are actually talking about in the reviews.

Python Implementation

In this section, we’ll power up our Jupyter notebooks (or any other IDE you use for Python!). Here we’ll work on the problem statement defined above to extract useful topics from our online reviews dataset using the concept of Latent Dirichlet Allocation (LDA).

Note: As I mentioned in the introduction, I highly recommend going through this article to understand what LDA is and how it works.

Let’s first load all the necessary libraries:
Python Code:

As you can see, the data contains the following columns:

reviewerID – ID of the reviewer
asin – ID of the product
reviewerName – name of the reviewer
helpful – helpfulness rating of the review, e.g. 2/3
reviewText – text of the review
overall – rating of the product
summary – summary of the review
unixReviewTime – time of the review (unix time)
reviewTime – time of the review (raw)

For the scope of our analysis and this article, we will be using only the reviews column, i.e., reviewText.

Data Preprocessing

Data preprocessing and cleaning is an important step before any text mining task, in this step, we will remove the punctuations, stopwords and normalize the reviews as much as possible. After every preprocessing step, it is a good practice to check the most frequent words in the data. Therefore, let’s define a function that would plot a bar graph of n most frequent words in the data.

# function to plot most frequent terms
def freq_words(x, terms = 30):
  all_words = ' '.join([text for text in x])
  all_words = all_words.split()

  fdist = FreqDist(all_words)
  words_df = pd.DataFrame({'word':list(fdist.keys()), 'count':list(fdist.values())})

  # selecting top 20 most frequent words
  d = words_df.nlargest(columns="count", n = terms) 
  plt.figure(figsize=(20,5))
  ax = sns.barplot(data=d, x= "word", y = "count")
  ax.set(ylabel = 'Count')
  plt.show()

Let’s try this function and find out which are the most common words in our reviews dataset.

freq_words(df['reviewText'])

Most common words are ‘the’, ‘and’, ‘to’, so on and so forth. These words are not so important for our task and they do not tell any story. We’ have to get rid of these kinds of words. Before that let’s remove the punctuations and numbers from our text data.

# remove unwanted characters, numbers and symbols
df['reviewText'] = df['reviewText'].str.replace("[^a-zA-Z#]", " ")

Let’s try to remove the stopwords and short words (<2 letters) from the reviews.

from nltk.corpus import stopwords
stop_words = stopwords.words('english')

# function to remove stopwords
def remove_stopwords(rev):
    rev_new = " ".join([i for i in rev if i not in stop_words])
    return rev_new

# remove short words (length < 3)
df['reviewText'] = df['reviewText'].apply(lambda x: ' '.join([w for w in x.split() if len(w)>2]))

# remove stopwords from the text
reviews = [remove_stopwords(r.split()) for r in df['reviewText']]

# make entire text lowercase
reviews = [r.lower() for r in reviews]

Let’s again plot the most frequent words and see if the more significant words have come out.

freq_words(reviews, 35)

We can see some improvement here. Terms like ‘battery’, ‘price’, ‘product’, ‘oil’ have come up which are quite relevant for the Automotive category. However, we still have neutral terms like ‘the’, ‘this’, ‘much’, ‘they’ which are not that relevant.

To further remove noise from the text we can use lemmatization from the spaCy library. It reduces any given word to its base form thereby reducing multiple forms of a word to a single word.

!python -m spacy download en # one time run

nlp = spacy.load('en', disable=['parser', 'ner'])

def lemmatization(texts, tags=['NOUN', 'ADJ']): # filter noun and adjective
       output = []
       for sent in texts:
             doc = nlp(" ".join(sent)) 
             output.append([token.lemma_ for token in doc if token.pos_ in tags])
       return output

Let’s tokenize the reviews and then lemmatize the tokenized reviews.

tokenized_reviews = pd.Series(reviews).apply(lambda x: x.split())
print(tokenized_reviews[1])

['these', 'long', 'cables', 'work', 'fine', 'truck', 'quality', 'seems', 'little', 'shabby', 
'side', 'for', 'money', 'expecting', 'dollar', 'snap', 'jumper', 'cables', 'seem', 'like', 
'would', 'see', 'chinese', 'knock', 'shop', 'like', 'harbor', 'freight', 'bucks']

reviews_2 = lemmatization(tokenized_reviews)
print(reviews_2[1]) # print lemmatized review

['long', 'cable', 'fine', 'truck', 'quality', 'little', 'shabby', 'side', 'money', 'dollar', 
'jumper', 'cable', 'chinese', 'shop', 'harbor', 'freight', 'buck']

As you can see, we have not just lemmatized the words but also filtered only nouns and adjectives. Let’s de-tokenize the lemmatized reviews and plot the most common words.

reviews_3 = []
for i in range(len(reviews_2)):
    reviews_3.append(' '.join(reviews_2[i]))

df['reviews'] = reviews_3

freq_words(df['reviews'], 35)

It seems that now most frequent terms in our data are relevant. We can now go ahead and start building our topic model.

Building an LDA model

We will start by creating the term dictionary of our corpus, where every unique term is assigned an index

dictionary = corpora.Dictionary(reviews_2)

Then we will convert the list of reviews (reviews_2) into a Document Term Matrix using the dictionary prepared above.

doc_term_matrix = [dictionary.doc2bow(rev) for rev in reviews_2]

# Creating the object for LDA model using gensim library
LDA = gensim.models.ldamodel.LdaModel

# Build LDA model
lda_model = LDA(corpus=doc_term_matrix, id2word=dictionary, num_topics=7, random_state=100,
                chunksize=1000, passes=50)

The code above will take a while. Please note that I have specified the number of topics as 7 for this model using the num_topics parameter. You can specify any number of topics using the same parameter.

Let’s print out the topics that our LDA model has learned.

lda_model.print_topics()

[(0, '0.030*"car" + 0.026*"oil" + 0.020*"filter" + 0.018*"engine" + 0.016*"device" + 0.013*"code" 
+ 0.012*"vehicle" + 0.011*"app" + 0.011*"change" + 0.008*"bosch"'), 
(1, '0.017*"easy" + 0.014*"install" + 0.014*"door" + 0.013*"tape" + 0.013*"jeep" + 0.011*"front" + 
0.011*"mat" + 0.010*"side" + 0.010*"headlight" + 0.008*"fit"'), 
(2, '0.054*"blade" + 0.045*"wiper" + 0.019*"windshield" + 0.014*"rain" + 0.012*"snow" + 
0.012*"good" + 0.011*"year" + 0.011*"old" + 0.011*"car" + 0.009*"time"'), 
(3, '0.044*"car" + 0.024*"towel" + 0.020*"product" + 0.018*"clean" + 0.017*"good" + 0.016*"wax" + 
0.014*"water" + 0.013*"use" + 0.011*"time" + 0.011*"wash"'), 
(4, '0.051*"light" + 0.039*"battery" + 0.021*"bulb" + 0.019*"power" + 0.018*"car" + 0.014*"bright" 
+ 0.013*"unit" + 0.011*"charger" + 0.010*"phone" + 0.010*"charge"'), 
(5, '0.022*"tire" + 0.015*"hose" + 0.013*"use" + 0.012*"good" + 0.010*"easy" + 0.010*"pressure" + 
0.009*"small" + 0.009*"trailer" + 0.008*"nice" + 0.008*"water"'), 
(6, '0.048*"product" + 0.038*"good" + 0.027*"price" + 0.020*"great" + 0.020*"leather" + 
0.019*"quality" + 0.010*"work" + 0.010*"review" + 0.009*"amazon" + 0.009*"worth"')]

The fourth topic Topic 3 has terms like ‘towel’, ‘clean’, ‘wax’, ‘water’, indicating that the topic is very much related to car-wash. Similarly, Topic 6 seems to be about the overall value of the product as it has terms like ‘price’, ‘quality’, and ‘worth’.

Topics Visualization

To visualize our topics in a 2-dimensional space we will use the pyLDAvis library. This visualization is interactive in nature and displays topics along with the most relevant words.

# Visualize the topics
pyLDAvis.enable_notebook()
vis = pyLDAvis.gensim.prepare(lda_model, doc_term_matrix, dictionary)
vis

Full code is available here.

Other Methods to Leverage Online Reviews

Apart from topic modeling, there are many other NLP methods as well which are used for analyzing and understanding online reviews. Some of them are listed below:

Text Summarization: Summarize the reviews into a paragraph or a few bullet points.
Entity Recognition: Extract entities from the reviews and identify which products are most popular (or unpopular) among the consumers.
Identify Emerging Trends: Based on the timestamp of the reviews, new and emerging topics or entities can be identified. It would enable us to figure out which products are becoming popular and which are losing their grip on the market.
Sentiment Analysis: For retailers, understanding the sentiment of the reviews can be helpful in improving their products and services.

What’s Next?

Information retrieval saves us from the labor of going through product reviews one by one. It gives us a fair idea of what other consumers are talking about the product.

However, it does not tell us whether the reviews are positive, neutral, or negative. This becomes an extension of the problem of information retrieval where we don’t just have to extract the topics, but also determine the sentiment. This is an interesting task which we will cover in the next article.

End Notes

Topic modeling is one of the most popular NLP techniques with several real-world applications such as dimensionality reduction, text summarization, recommendation engine, etc.. The purpose of this article was to demonstrate the application of LDA on a raw, crowd-generated text data. I encourage you to implement the code on other datasets and share your findings.

If you have any suggestion, doubt, or anything else that you wish to share regarding topic modeling, then please feel free to use the comments section below.

If you are looking to get into the field of Natural Language Processing, then we have a video course designed for you covering Text Preprocessing, Topic Modeling, Named Entity Regognition, Deep Learning for NLP and many more topics.

Prateek Joshi 26 Jul, 2022

Data Scientist at Analytics Vidhya with multidisciplinary academic background. Experienced in machine learning, NLP, graphs & networks. Passionate about learning and applying data science to solve real world problems.

Classification Data Science Intermediate NLP Project

Frequently Asked Questions

Responses From Readers

Unathi 16 Oct, 2018

Thanks man

1

Show 1 reply

Gokul Raj 16 Oct, 2018

Thanks pratik. That was a nice article to read. Is there any techniques to find out the synonyms and antonyms in NLP?

Vidyush Bakshi 16 Oct, 2018

A simple and effective approach .. keep up the good work!!

Sarra 18 Oct, 2018

very clear and important article ! Thank you

Deniz 01 Nov, 2018

Very clear steps thank you. Is there a link to download the dataset?

1

Show 1 reply

Prateek Joshi 01 Nov, 2018

Hi Deniz, the link to download the dataset is given in the "Setting the Problem Statement" section. Regards, Prateek

Jingmiao Shen 07 Nov, 2018

Such a great blog! With both concept and code, easy to follow! Nice work, man

DC 06 Dec, 2018

Getting the following warning repeatedly score += np.sum(cnt * logsumexp(Elogthetad + Elogbeta[:, int(id)]) for id, cnt in doc) /usr/local/lib/python2.7/dist-packages/gensim/models/ldamodel.py:1077: DeprecationWarning: Calling np.sum(generator) is deprecated, and in the future will give a different result. Use np.sum(np.from_iter(generator)) or the python sum builtin instead.

Arihant 18 Apr, 2019

What is the importance of x and y value of the cluster and what to imply from it?

Mj 27 Apr, 2019

Thanks for the thorough and clearly explained article. Helps anyone quickly get started on such an important technique.

Don 07 May, 2019

Great article! Thanks Prateek! I was wondering if you've ever done the follow-up article you mention here or another article for extracting features with a score (or how to make a list of pros and cons from review tags).