Saikat Das — Published On October 14, 2022 and Last Modified On November 4th, 2022
Beginner Machine Learning NLP

This article was published as a part of the Data Science Blogathon.


A vast amount of textual data is generated daily through posts, likes, and tweets by social networking sites such as Facebook, Instagram, and Twitter. This data contains a lot of information we can harness to generate insights. Still, most of this data is unstructured and is not ready for statistical analysis. Managing unstructured data for business benefit can be understood since around 80% of the business data is unstructured. With the exponential growth of social media, its share will keep increasing with time.

This vast amount of information can help us create valuable insights, but this information is highly unstructured and needs to be processed for analysis. This article looks at the results of a use case created using data extracted from Twitter to generate insights about a brand’s image after a scandal about the brand became public.

Analyzing this unstructured data in a text can help marketers in customer experience management, brand monitoring, etc., by transforming a massive amount of unstructured customer feedback into actionable insights. One common issue while analyzing this enormous amount of free-form text is that no human can read it in a reasonable amount of time. In this case, text mining is the answer to dealing with unstructured data and unlocking the value of customer feedback. This article investigates how Tweets made on a topic unlocked valuable insights as a use case.

Text Mining Application


One of the most significant controversies that recently rocked the whole car industry occurred when Volkswagen (VW) cheated on pollution emissions tests in the US. The VW scandal has raised the eyebrows of customers worldwide. Dubbed the “diesel dupe”, the German car giant has admitted to cheating emissions tests in the US. According to the Environmental Protection Agency (EPA), some cars sold in America had devices in diesel engines that could detect when they were tested, changing the performance accordingly to improve results.

The EPA’s findings cover 482,000 cars in the US only. But VW has admitted that about 11 million vehicles worldwide are fitted with the so-called “defeat device”. Under such circumstances, it is interesting to analyze customers’ Tweets to see what they are talking about the company. To create this use case, tweets were extracted using the search criterion “Volkswagen” just after the Volkswagen(VW) emission scandal became public. The aim of analyzing the tweets related to VW was to understand the current perception of the consumer about VW and its cars in light of the scandal.


There are a large number of tools and technologies available in the market for performing text analytics. Still, open-source text mining packages in Python and R programming languages are probably the most popular. Packages in these two programming languages are preferred by data scientists for extracting data from Twitter and performing analysis using it because Python and R both have advanced graphical capability and due to their open-source nature, these programming languages have a large and supportive community.


The approach taken is broadly classified under three steps as depicted in the figure,

                                             Figure 1. Steps in Text Analytics

In the first step, the data is extracted from Twitter using the search criterion “Volkswagen”. It involves creating a Twitter application using the Developer section of Twitter and writing code in Python or R to use a credential object to establish a secure connection and extract Tweets on the desired topic. For example, R libraries, “twitteR” and “ROAuth”, can be used to extract and store the raw data in a Comma Separated Values (CSV) file. The following code in R shows how the tweets were extracted.

After the Tweets are extracted, we need to perform the second step, i.e., preprocessing. The CSV file containing tweets has multiple columns, such as: “text”, “favorite”, “created”, “screenName”, “retweetCount” etc. Since we dealt with only the data in the “text” column, we separated this information and stored it in a text file. A sample Tweet from the data extracted looked like, “Volkswagen: German prosecutors launch investigation into former boss…”.

We can observe that the Tweets have a definite pattern that ends with an URL starting with “http://” or “https://”. As the first step under preprocessing, we need to clean the data by removing such URLs from the extracted text. For this task, the “gsub” function of R with the regular expression, “(f|ht)(tp)(s?)(://)(.*)[.|/](.*)” can be used as shown in the following code.

Next, we need to remove line breaks and join and collapse all the lines into one long string using the “paste” function. The string stored in a vector object is converted to lowercase. We must also remove blank spaces, usernames, and punctuation, as well as stop words from cleaning the text. Finally, we split the string and the regular expression “\W” to detect word boundaries that result in a list of words from the Tweets. After obtaining the list of words, we are ready to analyze the data. To begin the analysis, we calculate the number of unique words. Then we build a table of word types and their corresponding frequencies. The following code is used to perform these steps.

Finally, we created a corpus of frequent words and generated the following word cloud.

This “Word Cloud”, generated from the Tweets, help us to visually represent the words that appear more frequently and aids in understanding their prominence in the text analyzed.


Some challenges that we face in this exercise are information extracted from Twitter is highly unstructured form, hence the data needs to be preprocessed and cleaned to apply techniques of statistical analysis. Finally, the amount of data that can be extracted from Twitter and processed for further analysis is limited due to restrictions.

Text Mining Analysis

In the use case, our journey started with two thousand Tweets, accounting for 27,157 words. But, after preprocessing and data cleaning, we arrived at 2,919 unique words. We created a frequency table for these unique words along with the count of their occurrence, and finally, we generated a word cloud from it. When we looked at the frequently occurring words in the word cloud, we found:

cheated, deception, annoying, emissions, deaths”

The prevalence of words indicating negative sentiment among the customers has been evident since we analyzed it. Tweets were made after one of the biggest scandals in the auto industry was exposed. At that point, the brand’s image, and the industry, in general, are expected to be negative. But, if we look closely, we get some more exciting words that lead to valuable insight into the topic. One such example of words from the word cloud is:

elon, musk”

The auto industry, especially the diesel car makers, lost a considerable amount of credibility and power to influence customers and the government, as illustrated in the rise of companies like electric car makers such as Tesla. While the Volkswagen emissions scandal has prompted investigations into other car brands in both Europe and the US, Tesla Motors CEO Elon Musk said customers might seriously consider the time to give up on fossil fuels and embrace new technology. Since 2015 when this scandal happened till date, we have indeed witnessed a phenomenal rise in Tesla if not the electric car industry. Another such set of words is:

cars, deception, cheated”

Germans are generally considered the best in engineering, and their cars are associated with performance, quality, and reliability. Still, as evident from our analysis of data from Twitter, the VW scandal dented the reputation of not only VW but also the German car manufacturers.

This use case demonstrates the power of social media analytics through a simple example of generating a word cloud from Tweets without even getting into fancy algorithms and advanced analytics techniques. It shows how text mining, even at a basic level, can give useful insights into how customers perceive a brand, but that’s not all. We can further take it forward by performing sentiment analysis and comparing Tweets on products and brands from competitors. We can also enrich our study by considering the other fields, such as retweet, longitude, and latitude, which are also available in the data extracted from Twitter.


As depicted in this article through the use case, a text mining application using straightforward techniques can indicate a fundamental shift in the direction and pace at which the industry will move. It shows how using text mining; we can measure customers’ perceptions about a company or its brand and why such a perception occurred. This method can be beneficial for marketers in analyzing the gap between brand identity and brand image. In this article, we descriptively used text analytics to understand the perception of customers about a particular company and its brands after an event has occurred, which can be repeated for other brands and other events. Similarly, we can also use text analytics in a predictive manner to understand the future outcome of events.

Thus, we can safely conclude that the application of text analytics on social media content is the way for different business industries to understand consumer sentiment about a brand and decide their future course of action. And for this, the reader can start analyzing with simple techniques like generating a word cloud before leaping into the ocean of text mining, waiting to be explored further.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

About the Author

Saikat Das

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *