Text Cleaning Methods in NLP
This article was published as a part of the Data Science Blogathon.
In any machine learning task or data analysis task the first and foremost step is to clean and process the data. Cleaning is important for model building. Well, cleaning of data depends on the type of data and if the data is textual then it is more vital to clean the data.
Well, there are various types of text processing techniques that we can apply to the text data, but we need to be careful while applying and choosing the processing steps. Here, the steps of processing the textual data depend on the use cases.
For example, in sentiment analysis, we don’t need to remove emojis or emoticons from the text as they convey the sentiment of the text. In this article, we will see some common methods and their code to clean the textual data.
Text cleaning is task-specific and one needs to have a strong idea about what they want their end result to be and even review the data to see what exactly they can achieve.
Take a couple of minutes and explore the data. What do you notice at a first glance?
Here’s what a trained I see:
- Having too many typos or spelling mistakes in the text
- Having too many numbers and punctuations (E.g. Love!!!!)
- Text is full of emojis and emoticons and username and links too. (If the text is from Twitter or Facebook)
- Some of the text parts are not in the English language. Data is having a mixture of more than one language
- Some of the words are combined with the hyphen or data having contractions words. (E.g. text-processing)
- Repetitions of words (E.g. Data)
Well, honestly there are many more things that a trained eye can see. But if we look in general and just want an overview then follow the article for it.
Most common methods for Cleaning the Data
We will see how to code and clean the textual data for the following methods.
- Lowecasing the data
- Removing Puncuatations
- Removing Numbers
- Removing extra space
- Replacing the repetitions of punctations
- Removing Emojis
- Removing emoticons
- Removing Contractions
Importing the library
import pandas as pd
Let’s read the sample data
Lower Casing the Data
From the first glance we just lower case the data. The idea is to convert the input text into the same casing format so that it converts ‘DATA’, ’Data’, ‘DaTa’, ‘DATa’ into ‘data’.
In some use cases, like the tokenizer and vectorization processes, the lower casing is done beforehand. But choose the lower casing precisely because if we are doing sentiment analysis on the text then if we make the text in lower case then sometimes we might miss what the word is actually stating. For example, if the word is in the upper case then it refers to anger and so on.
Here, for lower casing the data we will use the lower() method to convert all the text into one common lower format.
The second most common text processing technique is removing punctuations from the textual data. The punctuation removal process will help to treat each text equally. For example, the word data and data! are treated equally after the process of removal of punctuations.
We need to take care of the text while removing the punctuation because the contraction words will not have any meaning after the punctuation removal process. Such as ‘don’t’ will convert to ‘dont’ or ‘don t’ depending upon what you set in the parameter.
We also need to be extra careful while choosing the list of punctuations that we want to exclude from the data depending upon the use cases. As string.punctuation in python contains these symbols
Sometimes number doesn’t hold any vital information in the text depending upon the use cases. So it is better to remove them than to keep them.
For example, when we are doing sentiment analysis then the number doesn’t hold any specific meaning to the data but if the task is to perform NER (Name Entity Recognition) or POS (Part of Speech tagging) then use the removing of number technique carefully.
Here, we are using the isdigit() function to see if the data has a number in it or not, and if we encountered the number then we are replacing the number with the blank.
ans = ''.join([i for i in text if not i.isdigit()]) ans #Output 'I had such high hopes for this dress size or (my usual size) to work for me.''
Removing Extra Space
Well, removing the extra space is good as it doesn’t store extra memory and even we can see the data clearly.
ans = " ".join(text.split()) ans #Output 'I had such high hopes for this dress 15 size or (my usual size) to work for me.'
Replacing the Repetitions of Punctuations
Having knowledge of regular expression will help to code faster and easier. To remove the repetition of punctuations is very helpful because it doesn’t hold any vital information if we keep more than one punctuation in the word, for example, data!!! need to convert to data.
Let’s first see how to replace the repetitions of punctuations. Here, we are replacing the word dress!!!! to dress and just replacing one punctuation only.
text1 = "I had such... high hopes for this dress!!!!" ans = re.sub(r'(!)1+', '', text1) ans #Output 'I had such... high hopes for this dress'
What if the text has more than just one punctuation in them let’s look at the below example to understand it.
import re text1 = "I had such... high hopes for this dress!!!!" ans = re.sub(r'(!|.)1+', '', text1) ans #Output 'I had such high hopes for this dress'
Growing users of the audience on the social media platforms, well there is a significant explosion of usage of emojis in day-to-day life. Well, when we are performing text analysis in some cases removal of emojis is the correct way as sometimes they don’t hold any information.
Below is the helper function from which the emojis will be replaced with the blank.
def remove_emoji(string): emoji_pattern = re.compile("[" u"U0001F600-U0001F64F" # emoticons u"U0001F300-U0001F5FF" # symbols & pictographs u"U0001F680-U0001F6FF" # transport & map symbols u"U0001F1E0-U0001F1FF" # flags (iOS) u"U00002702-U000027B0" u"U000024C2-U0001F251" "]+", flags=re.UNICODE) return emoji_pattern.sub(r'', string) remove_emoji("game is on 🔥🔥") #Output 'game is on '
The code of Removal of emojis is taken from: here
While doing the text analysis of Twitter and Instagram data we often find this emoticon and nowadays, there is hardly any text which doesn’t contain any emoticons in them.
The below helper function help to remove the emoticons from the text. The EMOTICIONS dictionary consists of the symbols and names of the emoticons you can customize the EMOTICONS as per your need.
The code of Removal of emoticons is taken from: here
While sometimes we don’t want the emoticons so, we remove them but what if I say there is a way around it. Let’s see if we remove the emotions and put alternative words, for example, removing this “:-)” emoticon and replacing it with text such as Happy face smiley or any custom name you like.
def convert_emoticons(text): for emot in EMOTICONS: text = re.sub(u'('+emot+')', "_".join(EMOTICONS[emot].replace(",","").split()), text) return text text = "Hello :-)" convert_emoticons(text) #Output 'Hello Happy_face_smiley'
The reference of code is taken from here
There are so many contractions in the text we type so to expand them we will use the contractions library.
The Twitter and Instagram data has so many contractions in them and if we remove the punctuations from that text then it would look like this.
For example, the text “I’ll eat pizza” and if we remove the punctuations them the text will look like this “I ll will eat pizza”. Here, “I ll” doesn’t hold any information to the text that’s why we use the contraction.
Importing the library
!pip install contractions
Let’s see how it’s done.
import contractions text = "She'd like to know how I'd do that!" contractions.fix(text) #Output she would like to know how I would do that!
We saw what are the most common techniques to clean and process the data. With each subsection, we saw techniques of how to remove them and when to remove them with the use cases. Additionally, what kind of situation do we need to avoid while applying the techniques to remove and clean the data for text analysis purposes or many more. Following this article with codes and examples will help you gain knowledge of text cleaning.
Image Reference on the header: here
About the Author,
You can contact me on any of the below mediums:
The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.