The days when one would get data in tabulated spreadsheets are truly behind us. A moment of silence for the data residing in the spreadsheet pockets. Today, more than 80% of the data is unstructured – it is either present in data silos or scattered around the digital archives. Data is being produced as we speak – from every conversation we make in the social media to every content generated from news sources. In order to produce any meaningful actionable insight from data, it is important to know how to work with it in its unstructured form. As a Data Scientist at one of the fastest growing Decision Sciences firm, my bread and butter comes from deriving meaningful insights from unstructured text information.
One of the first steps in working with text data is to pre-process it. It is an essential step before the data is ready for analysis. Majority of available text data is highly unstructured and noisy in nature – to achieve better insights or to build better algorithms, it is necessary to play with clean data. For example, social media data is highly unstructured – it is an informal communication – typos, bad grammar, usage of slang, presence of unwanted content like URLs, Stopwords, Expressions etc. are the usual suspects.
In this blog, therefore I discuss about these possible noise elements and how you could clean them step by step. I am providing ways to clean data using Python.
As a typical business problem, assume you are interested in finding: which are the features of an iPhone which are more popular among the fans. You have extracted consumer opinions related to iPhone and here is a tweet you extracted:
[stextbox id = “grey”]“I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com”
[/stextbox]Here is what you do:
[stextbox id = “grey”]
Snippet:
tweet = original_tweet.decode("utf8").encode(‘ascii’,’ignore’)
Output:
>> “I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com”
[/stextbox]
For example “it’s is a contraction for it is or it has”.
All the apostrophes should be converted into standard lexicons. One can use a lookup table of all possible keys to get rid of disambiguates.
[stextbox id = “grey”]
Snippet:
APPOSTOPHES = {“'s" : " is", "'re" : " are", ...} ## Need a huge dictionary
words = tweet.split()
reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words]
reformed = " ".join(reformed)
Outcome:
>> “I luv my <3 iphone & you are awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com”
[/stextbox]
[stextbox id = “grey”]
Snippet:
cleaned = “ ”.join(re.findall(‘[A-Z][^A-Z]*’, original_tweet))
Outcome:
>> “I luv my <3 iphone & you are awsm apple. Display Is Awesome, sooo happppppy 🙂 http://www.apple.com”
[/stextbox]
[stextbox id = “grey”]
Snippet:
tweet = _slang_loopup(tweet)
Outcome:
>> “I love my <3 iphone & you are awesome apple. Display Is Awesome, sooo happppppy 🙂 http://www.apple.com”
[/stextbox]
[stextbox id = “grey”]
Snippet:
tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))
Outcome:
>> “I love my <3 iphone & you are awesome apple. Display Is Awesome, so happy 🙂 http://www.apple.com”
[/stextbox]
[stextbox id = “grey”]
Final cleaned tweet:
>> “I love my iphone & you are awesome apple. Display Is Awesome, so happy!” , <3 , 🙂
[/stextbox]
Hope you found this article helpful. These were some tips and tricks, I have learnt while working with a lot of text data. If you follow the above steps to clean the data, you can drastically improve the accuracy of your results and draw better insights. Do share your views/doubts in the comments section and I would be happy to participate.
Go Hack 🙂
Can All this be done in R?
Yes !! Packages like "tm" (text mining) provides support for many of these functions. Rest of them can be explicitly written in R.
Can you explain words standardization in more detail ,what are the improper formats ?
Hey Shivam, I want to tokenize 1000 tweets stored in a text file using a single loop in PYTHON , please can you help me with that? please reply asap. Thanks
Can you please explain the code in Standardizing words(Point 9) ? Also, what other approaches can I take for the standardization of words ?
how to remove repetitive dots in tweets? for example "i love...."
is there any downloadable dictionary for APPOSTOPHES "APPOSTOPHES = {“'s" : " is", "'re" : " are", ...} ## Need a huge dictionary" "
Hi Aneesh, Share your email, I will forward you the list.
Hey shivam, I am new to python and I am getting the following error when I try to use the HTMLParser( ) method: AttributeError: type object 'HTMLParser' has no attribute 'HTMLParser' Kindly help
Hi, You need to import from html package. I think this page has old python implementation. from html.parser import HTMLParser
Hi Shivam, Can you please provide dictionary for APPOSTOPHES, _slang_loopup and set of rules for word standardization ? It will be really helpful. My mail id : [email protected] . Thanks.
can i have what Partho wants to? Thanks! I really need that for my school assignment @[email protected]
Apologies for a basic question. I am very very new to Python and was trying to test this code in Python. But it does not work. Can you help me how should I start working on this?
Would you tell me "for_"in " tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet)) " what's meaning?thank you.
Thank you this is useful do you have code to implement slang_lookup in python
please provide dictionary for APPOSTOPHES, _slang_loopup and set of rules for word standardization and my mail id is [email protected]
Skype has launched its online-structured consumer beta on the entire world, following establishing it broadly from the United states and U.K. previously this month. Skype for Online also now works with Chromebook and Linux for immediate text messaging conversation (no video and voice yet, these call for a connect-in installment). The increase of the beta provides assistance for a longer set of dialects to help strengthen that overseas functionality
Please provide dictionary for APPOSTOPHES, _slang_loopup and set of rules for word standardization and my mail id is [email protected]
please provide the dictionary for APPOSTOPHES and my mail id is [email protected]
Hi, Please download the apostrophes dictionary from this link - https://drive.google.com/file/d/0B1yuv8YaUVlZZ1RzMFJmc1ZsQmM/view?usp=sharing
[…] باز هم به تمرین بیشتری نیاز دارید، این دوره آموزشی برای پاکسازی متن را بگذرانید. این دوره شما را در گام های مختلفی که در […]
I'm getting excited about this kind of beneficial information of your stuff in the future
Amazing article thanks or sharing..
i have large amount of tweets but in different languages but i only want the english tweets plz send me the code in python that will help me
can i get the dictionary of Appostrophe and slang lookup. Please its really urgent. emailId: [email protected]