Tavish Srivastava — August 27, 2014
Big data Business Analytics Data Exploration E-Commerce Intermediate Machine Learning R Technique Text Unstructured Data

The key to using unstructured data set is to identify the hidden structures in the data set.


This enables us to convert it to a structured and more usable format.In previous article (previous article on text mining ) we discussed the framework to use unstructured data set in predictive or descriptive modelling. In this article we will talk in more details to understand the data structure and clean unstructured text to make it usable for the modelling exercise. We will be using the same business problem as discussed in last article to understand these procedures.

Business Problem

You are the owner of Metrro cash n carry. Metrro has a tie up with Barcllays bank to launch co-branded cards. Metrro and Barcllay have recently entered into an agreement to share transactions data. Barcllays will share all transaction data done on their credit card on any retail store. Metrro will share all transaction done by any credit card on their stores. You wish to use this data to track where are your high value customers shopping other than Metrro.

To do this you need to fetch out information from the free transactions text available on Barcllays transaction data. For instance, a transaction with free text “Payment made to Messy” should be tagged as transaction made to the retail store “Messy”. Once we have the tags of retail store and the frequency of transactions at these stores for Metrro high value customers, you can analyze the reason of this customer outflow by comparing services between Metrro and the other retail store.

Understanding the dataset

Let us first look at the raw data to build a framework for data cleaning. Following are sample transactions on which we need to work on :

  1. Paymt made to : Messy 230023929#21 Barcllay
  2. Transactn made to : Big Bazaar 42323#2322 Barcllay
  3. Pay to messy : 342343#2434 Barcllay
  4. Messy bill pay 32344#24324 Barcllay

Let us observe the data carefully to understand what information can be derived out of this data set.

  1. An Action word like “payment”, “Paymt” , “Transactn” is present in every transaction. It is possible that the word “pay” and “transact” refers to different modes of payments like Credit Card payment or cash card  payment.
  2. The word in the end of every transaction is common. This should be the name of card used.
  3. Every transactions has a name of the vendor. However, this name is both in small and capital letters.
  4. There is number code in every transactions. We can comfortably ignore this code or derive out very meaningful information from this code. This code can possibly be the name of the area where the store is present, some kind of combination with the date of purchase or the customer code. If we are able to decode these numbers, we possibly will get to the next level of analysis. For instance, if we can find the area of transaction, we can do an area level analysis. Or say these codes caters to product family, and hence can be used to optimize our services.

Cleaning the dataset

Cleaning text data on R is extremely easy. For this analysis we will not be using the numbers at the end of the transactions. But in case you are to make a strong analysis, this is something you should definitely explore. In this dataset, we need to make following adjustments :

  1. Remove the numbers
  2. Remove the special character “#”
  3. Remove common words like “to” , “is” etc.
  4. Remove the common term “Barcllay” from the end of every sentence
  5. Remove Punctuation marks

Given our understanding of data, step 2& 5  , 3 & 4 can be combined to avoid extra efforts. In step 2, we simply need to remove a single character “#”, which is automatically done in R while removing other punctuations . We will combine the words in step 3 and step 4, and remove them together. You can use the following code to clean the data set. Once we have the clean data set, we will convert it into a term document matrix. You can use the following codes for this exercise :


> library(tm) 
> myCorpus <- Corpus(VectorSource(a)) 
> inspect(myCorpus)

<<VCorpus (documents: 4, metadata (corpus/indexed): 0/0)>>

[[1]] <<PlainTextDocument (metadata: 7)>>

Paymt made to Messy 230023929#21 Barcllay

[[2]] <<PlainTextDocument (metadata: 7)>>

Transactn made to Big Bazaar 42323#2322 Barcllay

[[3]] <<PlainTextDocument (metadata: 7)>>

Pay to messy 342343#2434 Barcllay

[[4]] <<PlainTextDocument (metadata: 7)>>

messy bill pay 32344#24324 Barcllay


> # remove punctuation

> myCorpus <- tm_map(myCorpus, removePunctuation)

> inspect(myCorpus)

<<VCorpus (documents: 4, metadata (corpus/indexed): 0/0)>>

[[1]] <<PlainTextDocument (metadata: 7)>>

Paymt made to Messy 23002392921 Barcllay

[[2]] <<PlainTextDocument (metadata: 7)>>

Transactn made to Big Bazaar 423232322 Barcllay

[[3]] <<PlainTextDocument (metadata: 7)>>

Pay to messy 3423432434 Barcllay

[[4]] <<PlainTextDocument (metadata: 7)>>

messy bill pay 3234424324 Barcllay

> # remove numbers

> myCorpus <- tm_map(myCorpus, removeNumbers)

> inspect(myCorpus)

<<VCorpus (documents: 4, metadata (corpus/indexed): 0/0)>>

[[1]] <<PlainTextDocument (metadata: 7)>>

Paymt made to Messy Barcllay

[[2]] <<PlainTextDocument (metadata: 7)>>

Transactn made to Big Bazaar Barcllay

[[3]] <<PlainTextDocument (metadata: 7)>>

Pay to messy Barcllay

[[4]] <<PlainTextDocument (metadata: 7)>>

messy bill pay Barcllay

> # remove stopwords

> # Add required words to the list

> myStopwords <- c(stopwords(‘english’), “Barcllay”)

> myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

> inspect(myCorpus)

<<VCorpus (documents: 4, metadata (corpus/indexed): 0/0)>>

[[1]] <<PlainTextDocument (metadata: 7)>>

Paymt made Messy

[[2]] <<PlainTextDocument (metadata: 7)>>

Transactn made Big Bazaar

[[3]] <<PlainTextDocument (metadata: 7)>>

Pay messy

[[4]] <<PlainTextDocument (metadata: 7)>>

messy bill pay

> myDtm <- TermDocumentMatrix(myCorpus, control = list(minWordLength = 1))

> inspect(myDtm)

<<TermDocumentMatrix (terms: 8, documents: 4)>>

Non-/sparse entries: 12/20

Sparsity : 62% Maximal term length: 9 Weighting : term frequency (tf)


Terms     1 2 3 4

bazaar    0 1 0 0

big       0 1 0 0

bill      0 0 0 1

made      1 1 0 0

messy     1 0 1 1

pay       0 0 1 1

paymt     1 0 0 0

transactn 0 1 0 0


End Notes

Cleaning data sets is a very crucial step in any kind of data mining. However, it is many times more important while dealing with unstructured data sets. Understanding the data and cleaning the data consumes the maximum time of any text mining analysis. In the next article we will talk about creating a dictionary manually. This becomes important when we are doing a niche analysis for which ready made dictionary is either not available or very expensive.

Have you done text mining before? If you did, what other cleaning steps did you leverage? What tool do you think is most suitable for doing a niche kind of text mining like transactions analysis or behavioral analysis? Did you find the article useful? Did this article solve any of your existing dilemma?

If you like what you just read & want to continue your analytics learning, subscribe to our emailsfollow us on twitter or like our facebook page.


About the Author

Tavish Srivastava

Tavish Srivastava, co-founder and Chief Strategy Officer of Analytics Vidhya, is an IIT Madras graduate and a passionate data-science professional with 8+ years of diverse experience in markets including the US, India and Singapore, domains including Digital Acquisitions, Customer Servicing and Customer Management, and industry including Retail Banking, Credit Cards and Insurance. He is fascinated by the idea of artificial intelligence inspired by human intelligence and enjoys every discussion, theory or even movie related to this idea.

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Ram Dewani
  • Faizan Shaikh
  • Aniruddha Bhandari

Download Analytics Vidhya App for the Latest blog/Article

10 thoughts on "Understanding and analyzing the hidden structures of unstructured dataset"

Rick Green
Rick Green says: August 28, 2014 at 4:53 am
You have used a different 4th example in your analysis? in your problem descriotion it started as Mart America paymt : 32344#24324 Barcllay but in the exercise was used messy bill pay 32344#24324 Barcllay Reply
Tavish Srivastava
Tavish Srivastava says: August 28, 2014 at 7:01 am
Rick, Thanks for the input. This has been changed. Regards, Tavish Reply
Gurpreet Singh
Gurpreet Singh says: August 29, 2014 at 12:38 pm
Hi Tavish Thanks for this article. I have around 9 yrs of experience in Pl-sql, moved to Technical BA role to last year. My work involves data analysis, data mining & data quality improvements. I am solving problems like these in Oracle for SWIFT messages processing. However it seems very easy in R. I want to move into Data Analytics/Science profile and have been exploring this space ever since I got aware of these terms on linked-in few months back. Could you please suggest some good books to learn & master R for data analysis. Would love to be a part of Apprenticeship program. Thanks, Gurpreet Reply
Naveen Yadav
Naveen Yadav says: August 31, 2014 at 9:40 am
Hi Tavish, i liked your your blog however is there a place from where i can learn business analytics from ground zero as i am also very keen to get into this industry of Analytics. Thanks & Regards, Naveen Yadav Reply
Tavish Srivastava
Tavish Srivastava says: August 31, 2014 at 1:13 pm
Naveen, We have many such articles which will give you a kick start to analytics. You can have a look on the entire list and let me know in case you are looking out for something specific which is not covered. Tavish Reply
Tavish Srivastava
Tavish Srivastava says: August 31, 2014 at 1:15 pm
Gurpreet, Apprentice program for this year has already started and you can see many guest post coming out from this group. You can try for the same next year. For R, you can refer to courses offered by corsera. Books generally are for very specific topic and not cover the entire ground. Hope this helps. Tavish Reply
Vaibhav Jain
Vaibhav Jain says: September 02, 2014 at 8:34 am
Hi Naveen, For starting your career in analytics field, today lot of online & classroom platforms are available. Incase, you are interested for Full Time 9 months program in Business Analytics, you can apply for PGPBA program at Praxis Business School, Kolkatta. Its the only institute in India who is offering FT classroom sessions. Alternatively, you can explore online portals similar to analyticsvidhya which share learnings on different aspects of analytics. Vaibhav Reply
Anoop says: September 11, 2014 at 10:11 am
Hats off to your Blog efforts. I really appreciate the knowledge and the material you people upload is very resourceful, I will appreciate if you can perform the complete analysis so that new comer can understand it in a better way. Analysis should be done in a step by step manner.I applaud for your efforts. Anoop Gandhi Reply
hemanth says: July 31, 2015 at 7:43 am
Your effort in explaining and making us to understand the loops of analytics in real world scenarios is really appreciable. Cheers to Analytics vidhya :) Reply
Harshal Parkhe
Harshal Parkhe says: February 04, 2016 at 10:46 am
Hi Tavish, I have one problem regarding Information extraction from Unstrutured data. I have dataset of address string like for eg. Apartment No: 506 A, Floor No: 5th Floor, Building Name: Panchvati co operative housing Society limited, Block No.: Andheri East, Mumbai 400059, Road No: maroshi Road, Marol, other information: the monthly rent for a period of 24 months and 10% of 31000 increase the amount of the deposit and 100000 And want to extract information like Bulding name, City name, Pincode,area. Can you please help me to do same. I use python or R tool for it. Thanks Reply

Leave a Reply Your email address will not be published. Required fields are marked *