Analysis of Brazilian E-commerce Text Review Dataset Using NLP and Google Translate

John Last Updated : 02 Dec, 2020

5 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Comprehending the reviews of customers is very crucial for a business to be successful. Analyzing the reviews helps to properly discern the customer different preferences, likes, dislikes, etc. These extracted insights can then be used to improve customer service and experience.

In this article, we would be working on a Brazilian E-commerce reviews dataset where we would perform some exploratory data analysis (EDA) on reviews text, derive meaningful insights from our analysis, and then translate our outputs from Portuguese to the English language for better and easier understanding.

Dataset Description

This dataset was taken from Kaggle and was generously provided by Olist online stores in Brazil. Here’s the link to the dataset.

This Kaggle project has multiple datasets containing different fields such as orders, payments, geolocation, products, products_category, etc. but we would be solely focusing on the text reviews dataset for our analysis. The reviews dataset has 100,000 datapoints and after getting rid of NaN values, 40,000 reviews were left which is sufficient for analysis. Some of the reviews had titles and all had to review scores so we also have to take these into consideration.

Methodology

Installing and Importing libraries

We have to, first of all, import important libraries (such as pandas, numpy, and matplotlib) and NLTK tools for our natural language processing.

For our language translation, we need to install Google Translate API. I initially encountered some error in making use of the original API but I was able to find a working version a few days ago named ‘Google_trans_new’ that effectively solved the problem. You can visit here to know more-

If you have not yet installed it, go ahead and do so because this would be used for our translation later. From the image above, we can see the list of languages that are supported by this translate API. For now, we are only interested in the Portuguese and English languages. Let’s run a quick example;

reviews text analysis - excellent backup

We can see how our Portuguese text got translated and we can decide to translate to any other language supported by this API.

Data Preprocessing

Next, we would read in our data and carry out some initial EDA and dataset modification.

The review dataset has a significant amount of NaN values in the reviews text and reviews title so we would make sure to drop those missing values and reset the index.

Just like the normal Day-to-day NLP tasks carried out, we need to implement some necessary preprocessing steps which involve; transforming the reviews data by removing stop words, using the regular expression module to accept only letters, tokenizing the text, and making all the words lower case for consistency. In this case, we would have to remove Portuguese stop words.

Data Visualization

After our data preprocessing, it is time to visualize our review text using Wordclouds. A Wordcloud is a visual representation of the involved text data and it displays the importance of the words by the font size.

The Wordcloud above shows the most frequent words in Portuguese and this would not mean much if you don’t understand Portuguese so we need to translate these words to English. At this point, Countvectorizer was used to get the most important Unigrams(one word), Bigrams(two words), and Trigrams(three words). Below are the codes and images showing the trigrams and the respective English translations. To get the unigrams and bigrams, replace ngram_range =(1,1) and (2,2) respectively.

reviews text analysis - most frequently used words translated to english

From the unigrams, bigrams, and trigrams that were gotten, we can safely deduce that most customers were satisfied with the delivery service and some others were satisfied with the product quality.

Similar preprocessing steps were also applied to the reviews titles column and Wordcloud was used to visualize these titles.

Carrying out similar operations on the reviews title column, we were able to better understand the unhappy comments of displeased customers. These comments include: Getting incomplete delivery, Not receiving ordered goods, Delay in delivery, Low quality of delivered goods, Receiving wrong/defective products. So far, these have been the major complaints of unhappy customers but we have also seen a high degree of satisfaction among other customers.

To further understand the relationship between customers who gave low/high review score and those who gave reviews, we need to do a count plot showing these figures;

We can notice that about 36% of 5-star reviewers gave reviews while 79% of 1-star reviewers gave reviews so a customer was more likely to give reviews when he/she is displeased.

Let’s also have a look at the average number of words per review;

Conclusion

So far, we have been able to the analysis of the text reviews of customers and we have seen the unhappy comments made by some of them. It’s now up to the data scientist/analyst to figure out the best ways in making sure these problems are solved.

At the end of this article, I am sure that you know how to be able to carry out basic text visualization and also be able to convert text languages to your preferred choice using Google Translate API. There are other numerous ideas that can be implemented on this dataset such as sentiment analysis and topic modeling, these were also covered in my notebook. The complete Github Notebook can be found here.

Happy Learning!

John

Free Courses

Build a Document Retriever Search Engine with LangChain

Learn to create a document retrieval search engine using LangChain.

4.6

Coding a ChatGPT-style Language Model From Scratch in Pytorch

Build a ChatGPT-style language model using PyTorch.

4.5

Naive Bayes from Scratch

Master Naïve Bayes for ML: Build classifiers, analyze data, and apply Bayes.

Reading list

Analysis of Brazilian E-commerce Text Review Dataset Using NLP and Google Translate

Introduction

Dataset Description

Methodology

Installing and Importing libraries

Data Preprocessing

Data Visualization

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Build a Document Retriever Search Engine with LangChain

Coding a ChatGPT-style Language Model From Scratch in Pytorch

Naive Bayes from Scratch

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Analysis of Brazilian E-commerce Text Review Dataset Using NLP and Google Translate

Introduction

Dataset Description

Methodology

Installing and Importing libraries

Data Preprocessing

Data Visualization

Conclusion

Login to continue reading and enjoy expert-curated content.

Free Courses

Build a Document Retriever Search Engine with LangChain

Coding a ChatGPT-style Language Model From Scratch in Pytorch

Naive Bayes from Scratch

Recommended Articles

Responses From Readers

Become an Author

Flagship Programs

Free Courses

Popular Categories

Generative AI Tools and Techniques

Popular GenAI Models

AI Development Frameworks

Data Science Tools and Techniques