Steps for effective text data cleaning (with case study using Python)
Introduction
The days when one would get data in tabulated spreadsheets are truly behind us. A moment of silence for the data residing in the spreadsheet pockets. Today, more than 80% of the data is unstructured – it is either present in data silos or scattered around the digital archives. Data is being produced as we speak – from every conversation we make in the social media to every content generated from news sources. In order to produce any meaningful actionable insight from data, it is important to know how to work with it in its unstructured form. As a Data Scientist at one of the fastest growing Decision Sciences firm, my bread and butter comes from deriving meaningful insights from unstructured text information.
One of the first steps in working with text data is to pre-process it. It is an essential step before the data is ready for analysis. Majority of available text data is highly unstructured and noisy in nature – to achieve better insights or to build better algorithms, it is necessary to play with clean data. For example, social media data is highly unstructured – it is an informal communication – typos, bad grammar, usage of slang, presence of unwanted content like URLs, Stopwords, Expressions etc. are the usual suspects.
In this blog, therefore I discuss about these possible noise elements and how you could clean them step by step. I am providing ways to clean data using Python.
As a typical business problem, assume you are interested in finding: which are the features of an iPhone which are more popular among the fans. You have extracted consumer opinions related to iPhone and here is a tweet you extracted:
[stextbox id = “grey”]“I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com”
[/stextbox]Steps for data cleaning:
Here is what you do:
- Escaping HTML characters: Data obtained from web usually contains a lot of html entities like < > & which gets embedded in the original data. It is thus necessary to get rid of these entities. One approach is to directly remove them by the use of specific regular expressions. Another approach is to use appropriate packages and modules (for example htmlparser of Python), which can convert these entities to standard html tags. For example: < is converted to “<” and & is converted to “&”.
- Decoding data: Thisis the process of transforming information from complex symbols to simple and easier to understand characters. Text data may be subject to different forms of decoding like “Latin”, “UTF8” etc. Therefore, for better analysis, it is necessary to keep the complete data in standard encoding format. UTF-8 encoding is widely accepted and is recommended to use.
[stextbox id = “grey”]
Snippet:
tweet = original_tweet.decode("utf8").encode(‘ascii’,’ignore’)
Output:
>> “I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com”
[/stextbox]
- Apostrophe Lookup: To avoid any word sense disambiguation in text, it is recommended to maintain proper structure in it and to abide by the rules of context free grammar. When apostrophes are used, chances of disambiguation increases.
For example “it’s is a contraction for it is or it has”.
All the apostrophes should be converted into standard lexicons. One can use a lookup table of all possible keys to get rid of disambiguates.
[stextbox id = “grey”]
Snippet:
APPOSTOPHES = {“'s" : " is", "'re" : " are", ...} ## Need a huge dictionary
words = tweet.split()
reformed = [APPOSTOPHES[word] if word in APPOSTOPHES else word for word in words]
reformed = " ".join(reformed)
Outcome:
>> “I luv my <3 iphone & you are awsm apple. DisplayIsAwesome, sooo happppppy 🙂 http://www.apple.com”
[/stextbox]
- Removal of Stop-words: When data analysis needs to be data driven at the word level, the commonly occurring words (stop-words) should be removed. One can either create a long list of stop-words or one can use predefined language specific libraries.
- Removal of Punctuations: All the punctuation marks according to the priorities should be dealt with. For example: “.”, “,”,”?” are important punctuations that should be retained while others need to be removed.
- Removal of Expressions: Textual data (usually speech transcripts) may contain human expressions like [laughing], [Crying], [Audience paused]. These expressions are usually non relevant to content of the speech and hence need to be removed. Simple regular expression can be useful in this case.
- Split Attached Words: We humans in the social forums generate text data, which is completely informal in nature. Most of the tweets are accompanied with multiple attached words like RainyDay, PlayingInTheCold etc. These entities can be split into their normal forms using simple rules and regex.
[stextbox id = “grey”]
Snippet:
cleaned = “ ”.join(re.findall(‘[A-Z][^A-Z]*’, original_tweet))
Outcome:
>> “I luv my <3 iphone & you are awsm apple. Display Is Awesome, sooo happppppy 🙂 http://www.apple.com”
[/stextbox]
- Slangs lookup: Again, social media comprises of a majority of slang words. These words should be transformed into standard words to make free text. The words like luv will be converted to love, Helo to Hello. The similar approach of apostrophe look up can be used to convert slangs to standard words. A number of sources are available on the web, which provides lists of all possible slangs, this would be your holy grail and you could use them as lookup dictionaries for conversion purposes.
[stextbox id = “grey”]
Snippet:
tweet = _slang_loopup(tweet)
Outcome:
>> “I love my <3 iphone & you are awesome apple. Display Is Awesome, sooo happppppy 🙂 http://www.apple.com”
[/stextbox]
- Standardizing words: Sometimes words are not in proper formats. For example: “I looooveee you” should be “I love you”. Simple rules and regular expressions can help solve these cases.
[stextbox id = “grey”]
Snippet:
tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet))
Outcome:
>> “I love my <3 iphone & you are awesome apple. Display Is Awesome, so happy 🙂 http://www.apple.com”
[/stextbox]
- Removal of URLs: URLs and hyperlinks in text data like comments, reviews, and tweets should be removed.
[stextbox id = “grey”]
Final cleaned tweet:
>> “I love my iphone & you are awesome apple. Display Is Awesome, so happy!” , <3 , 🙂
[/stextbox]
Advanced data cleaning:
- Grammar checking: Grammar checking is majorly learning based, huge amount of proper text data is learned and models are created for the purpose of grammar correction. There are many online tools that are available for grammar correction purposes.
- Spelling correction: In natural language, misspelled errors are encountered. Companies like Google and Microsoft have achieved a decent accuracy level in automated spell correction. One can use algorithms like the Levenshtein Distances, Dictionary Lookup etc. or other modules and packages to fix these errors.
End Notes:
Hope you found this article helpful. These were some tips and tricks, I have learnt while working with a lot of text data. If you follow the above steps to clean the data, you can drastically improve the accuracy of your results and draw better insights. Do share your views/doubts in the comments section and I would be happy to participate.
Go Hack 🙂
26 thoughts on "Steps for effective text data cleaning (with case study using Python)"
Manikandan T V says: November 17, 2014 at 10:47 am
Can All this be done in R?Shivam Bansal says: November 17, 2014 at 5:28 pm
Yes !! Packages like "tm" (text mining) provides support for many of these functions. Rest of them can be explicitly written in R.sreesha says: January 18, 2015 at 4:39 pm
Can you explain words standardization in more detail ,what are the improper formats ?Saif Ali says: January 20, 2016 at 2:35 pm
Hey Shivam, I want to tokenize 1000 tweets stored in a text file using a single loop in PYTHON , please can you help me with that? please reply asap. Thankshimani says: April 06, 2016 at 9:19 pm
hey, i am getting these errors can you help me tweet = _slang_loopup(tweet) NameError: name '_slang_loopup' is not definedAnurag Arora says: May 05, 2016 at 8:13 am
Can you please explain the code in Standardizing words(Point 9) ? Also, what other approaches can I take for the standardization of words ?Sofea says: August 05, 2016 at 11:08 am
how to remove repetitive dots in tweets? for example "i love...."Aneesh says: November 10, 2016 at 4:35 am
is there any downloadable dictionary for APPOSTOPHES "APPOSTOPHES = {“'s" : " is", "'re" : " are", ...} ## Need a huge dictionary" "Shivam Bansal says: November 10, 2016 at 5:46 am
Hi Aneesh, Share your email, I will forward you the list.Aneesh says: November 10, 2016 at 6:10 am
[email protected]WAL says: November 17, 2016 at 1:01 pm
Hey shivam, I am new to python and I am getting the following error when I try to use the HTMLParser( ) method: AttributeError: type object 'HTMLParser' has no attribute 'HTMLParser' Kindly helpPartho says: November 21, 2016 at 10:22 am
Hi Shivam, Can you please provide dictionary for APPOSTOPHES, _slang_loopup and set of rules for word standardization ? It will be really helpful. My mail id : [email protected] . Thanks.Mayank says: November 23, 2016 at 7:07 am
Apologies for a basic question. I am very very new to Python and was trying to test this code in Python. But it does not work. Can you help me how should I start working on this?Kevin Mahendra says: December 16, 2016 at 3:05 am
can i have what Partho wants to? Thanks! I really need that for my school assignment @[email protected]Hunter Red says: March 01, 2017 at 9:12 am
Would you tell me "for_"in " tweet = ''.join(''.join(s)[:2] for _, s in itertools.groupby(tweet)) " what's meaning?thank you.rajeswari says: May 20, 2017 at 10:02 am
Thank you this is useful do you have code to implement slang_lookup in pythonrajeswari says: May 20, 2017 at 11:59 am
please provide dictionary for APPOSTOPHES, _slang_loopup and set of rules for word standardization and my mail id is [email protected]Fanny says: July 09, 2017 at 7:48 am
Skype has launched its online-structured consumer beta on the entire world, following establishing it broadly from the United states and U.K. previously this month. Skype for Online also now works with Chromebook and Linux for immediate text messaging conversation (no video and voice yet, these call for a connect-in installment). The increase of the beta provides assistance for a longer set of dialects to help strengthen that overseas functionalityCao Tri DO says: August 22, 2017 at 9:06 pm
Please provide dictionary for APPOSTOPHES, _slang_loopup and set of rules for word standardization and my mail id is [email protected]Abhishek mamidi says: August 30, 2017 at 5:11 pm
please provide the dictionary for APPOSTOPHES and my mail id is [email protected]Rushat says: August 30, 2017 at 9:48 pm
I know I'm a year late :P, but could I please receive this dataset too?Shivam Bansal says: September 26, 2017 at 6:40 pm
Hi, Please download the apostrophes dictionary from this link - https://drive.google.com/file/d/0B1yuv8YaUVlZZ1RzMFJmc1ZsQmM/view?usp=sharingMohit says: September 30, 2017 at 11:21 pm
Hi, You need to import from html package. I think this page has old python implementation. from html.parser import HTMLParserمسیر فراگیری جامع علم داده در پایتون – پایگاه آموزشی کلان داده says: October 02, 2017 at 7:11 pm
[…] باز هم به تمرین بیشتری نیاز دارید، این دوره آموزشی برای پاکسازی متن را بگذرانید. این دوره شما را در گام های مختلفی که در […]Case Studies Analysis says: December 02, 2017 at 12:09 pm
I'm getting excited about this kind of beneficial information of your stuff in the futureCase study solutions says: December 04, 2017 at 12:52 pm
Amazing article thanks or sharing..