Build a word cloud using text mining tools of R

Tavish Srivastava 27 Jul, 2015 • 5 min read

wordcloud_2

 This is how a word cloud of our entire website looks like!

A word cloud is a graphical representation of frequently used words in a collection of text files. The height of each word in this picture is an indication of frequency of occurrence of the word in the entire text. By the end of this article, you will be able to make a word cloud using R on any given set of text files. Such diagrams are very useful when doing text analytics.

[stextbox id=”section”] Why do we need text analytics? [/stextbox]

Analytics is the science of processing raw information to bring out meaningful insights. This raw information can come from variety of sources. For instance, let’s consider a modern multinational bank, who wants to use all the available information to drive the best strategy. What are the sources of information available to the bank?

1. Bank issues different kinds of products to different customers. This information is fed to the system and can be used for targeting new customers, servicing existing customers and forming customer level strategies.

2. Customers of bank would be doing millions of transactions everyday. The information about where these transactions are done, when they are done and what amount of transactions where they helps bank to understand their customer.

There can be other behavioral variables (e.g. cash withdrawal patterns) which can provide the bank with valuable data, which helps the bank build optimal strategy. This analysis gives the bank, a competitive edge over other market players by targeting the right customer, with the right product at the right time. But, given that, at present every competitor is using similar kind of tools and data, analytics have become more of a hygiene factor rather than competitive edge. To gain the edge back, the bank has to find more sources of data and more sophisticated tools to handle this data.  All the data, we have discussed till this point is the structured data. There are two other types of data, the bank can use to drive insightful information.

1. System data : Consider a teller carrying out a transaction at one of the counter. Every time he completes a transaction, a log is created in the system. This type of data is called system data. It is obviously enormous in volumes, but still not utilized to a considerable extent in a lot of banks. If we do analyze this data, we can optimize the number of tellers in a branch or scale the efficiency of each branch.

2. Unstructured data : Feedback forms with free text comments, comments on Bank’s Facebook Page, twitter page, etc. are all examples of unstructured data.  This data has unique information about customer sentiment. Say, the bank launches a product and found that this product is very profitable in first 3 months. But customers who bought the product found that this product was really a bad choice and started spreading bad words about the product on all social networks and through feedback channels. If the bank has no way to decode this information, this will lead to a huge loss because the bank will never make a proactive effort to stop the negative wave against its image. Imagine, the kind of power analyzing such data hands over to the bank.

[stextbox id=”section”] Installing required packages on R  [/stextbox]

Text Mining needs some special packages, which might not be pre-installed on your R software.  You need to install Natural Language Processing package to load a library called tm and SnowballC.  Follow the instructions written in the box to install the required packages.

[stextbox id=”grey”]

> install.packages(“ctv”)

> library(“ctv”)

> install.views(“NaturalLanguageProcessing”)

[/stextbox]

[stextbox id=”section”] Step by step coding on R [/stextbox]

Following is the step by step algorithm of creating a word cloud on a bunch of text files. For simplicity, we are using files in .txt format.

Step 1 : Identiy & create text files to turn into a cloud

The first step is to identify & create text files on which you want to create the word cloud.  Store these files in the location “./corpus/target”. Make sure that you do not have any other file in this location. You can use any location to do this exercise, but for simplicity, try it with this location for the first time.

Step 2 : Create a corpus from the collection of text files

The second step is to transform these text files into a R – readable format. The package TM and other text mining packages operate on a format called corpus. Corpus is just a way to store a collection of documents in a R software readable format.

[stextbox id=”grey”]

 

> cname <- file.path(".","corpus","target")
> library (tm)
> docs <- Corpus(DirSource(cname))
[/stextbox]

Step 3 : Data processing on the text files

This is the most critical step in the entire process. Here, we will decode the text file by selecting some keywords, which builds up the meaning of the sentence or the sentiments of the author. R makes this step really easy. Here we will make 3 important transformations.

i. Replace symbols like “/” or “@” with a blank space

ii. Remove words like “a”, “an”, “the”, “I”, “He” and numbers. This is done to remove any skewness caused by these commonly occurring words.

iii.  Remove punctuation and finally whitespaces. Note that we are not replacing these with blanks because grammatically they will have an additional blank.

[stextbox id=”grey”]

 

> library (SnowballC)
> for (j in seq(docs))
+ {docs[[j]] <- gsub("/"," ",docs[[j]])
+  docs[[j]] <- gsub("@"," ",docs[[j]])}
> docs <- tm_map(docs,tolower)
> docs <- tm_map(docs,removeWords, stopwords("english"))
> docs <- tm_map(docs,removeNumbers)
> docs <- tm_map(docs,removePunctuation)
> docs <- tm_map(docs,stripWhitespace)
[/stextbox]

Step 4 : Create structured data from the text file

Now is the time to convert this entire corpus into a structured dataset. Note that we have removed all filler words. This is done by a command “DocumentTermMatrix” in R. Execute the following line in your R session to make this conversion.

[stextbox id=”grey”]

 

> dtm <- DocumentTermMatrix(docs)
[/stextbox]

 

Step 5 : Making the word cloud using the structured form of the data

Once, we have the structured format of the text file content, we now make a matrix of word and their frequencies. This matrix will be finally put into the function to build wordcloud.

 [stextbox id=”grey”]

 

> library(wordcloud)
> m <- as.matrix(dtm)
> v <- sort(colSums(m),decreasing=TRUE)
> head(v,14)
> words <- names(v)
> d <- data.frame(word=words, freq=v)
> wordcloud(d$word,d$freq,min.freq=50)
[/stextbox]

Running this code will give you the required output. The order of words is completely random but the length of the words are directly proportional to the frequency of occurrence of the word in text files.Our website has the world “Analytics Vidhya” repeated many times and hence this word has the maximum length. This diagram directly helps us identify the most frequently used words in the text files.

[stextbox id=”section”] End Notes [/stextbox]

Text mining deals with relationships between words and analyzing the sentiment made by the combination of these words and their relationship. Structured data have defined number of variables and all the analysis is done by finding out correlation between these variables. But in text mining the relationship is found between all the words present in the text.  This is the reason text mining is rarely used in the industry today. But R offers such tools which make this analysis much simpler. This article covers only the tip of the iceberg. In one of the coming articles, we will cover a framework of analyzing unstructured data.

Did you find the article interesting? Have you worked on text mining before? Did you use R to do text mining or some other softwares? Let us know any other interesting feature of R used for text mining.

If you like what you just read & want to continue your analytics learning, subscribe to our emailsfollow us on twitter or like our facebook page.

 

Tavish Srivastava 27 Jul 2015

Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate and a passionate data-science professional with 8+ years of diverse experience in markets including the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even movie related to this idea.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Eric J. Christeson
Eric J. Christeson 08 May, 2014

This is really neat. Thanks for writing about this. One issue I noticed is the ordering of the normalization. I would switch tolower and stopwords like so: > docs docs <- tm_map(docs,removeWords, stopwords("english")) otherwise an capitalized stop words such as 'The' are not removed. I ran it on a collection of books; many of the sentences must start with 'The' :-)

Avinash
Avinash 01 Sep, 2014

Hi Tavish, I am planning to do word cloud on a data set, which has the word and respective word count. Is it feasible/ appropriate to do word cloud on such data? If so can you please guide me. Thank you

Chung Vanwyngaarden
Chung Vanwyngaarden 09 Sep, 2014

whoah this blog is excellent i love reading your articles. Keep up the good work! You know, lots of people are searching around for this information, you can help them greatly.

Ankit Mundada
Ankit Mundada 14 Oct, 2014

I think the article needs an update. The new version of tm library forces you to use content_transformer() method to apply any other external processing of the corpus than the ones available in tm_map. I got an error saying "Error: inherits(doc, "TextDocument") is not TRUE". From the following link I got to know that the library is updated. http://stackoverflow.com/questions/24191728/documenttermmatrix-error-on-corpus-argument

Bob
Bob 13 Nov, 2014

Hi, I'm currently working with a heavy data set (a .txt file of about 100Mb) and I'm trying to develop a wordcloud, as a graphical output of the analysis. But, here´s where i continue to get problems, when I get to the TermDocumenMatrix command, it takes hours, and it seems like it might have stopped (the little stop sign in the top right corner of the console view doesn't disappear and the blue pointer signaling the next row doesn't appear either). Is this hardware related? Or, is it due to the size of the file? Thanks. Cheers.

Nissanka
Nissanka 15 Dec, 2014

Hi Avinash, I am exploring the possibility of creating a wordcloud using characters that are not from the latin alphabet. Do you have any ideas about this?

Riza
Riza 29 Apr, 2015

Brilliant work. Is there any possibility for a sentiment analysis to incorporate in this process? It would be very useful if the same corpus can be used to fetch the opinion bias.

Vinod Patidar
Vinod Patidar 03 Jun, 2015

Hi, I search same meaning word count in single word in Corpus library in R data, can any one help. Thanks

nandhini
nandhini 31 Jul, 2015

HI, I want to manually assign the number of occurrence of words.I have a text file with 200 distinct words(each word in a new line) .I have another file with 200 numbers .How can I assign these numbers as number of occurrences of the words?The words are mix of European languages and some of the words are breaking itself .what should I do to get the complete word?

ganesh
ganesh 15 Nov, 2017

Hi i have installed all TM library and related packages but "Corpus" is not getting working can anyone help me ? to solve this issue ?