Tokenization and Text Normalization
- Text data is a type of unstructured data used in natural language processing.
- Understand how to preprocess the text data before feeding it to the machine learning algorithms.
Text data is a form of unstructured data. The most prominent examples of text data available on the internet are social media data like tweets, posts, comments, or the Conversation data such as messages, emails, Chats. Also, it can be article data like news articles, blogs, etc.
Note: If you are more interested in learning concepts in an Audio-Visual format, We have this entire article explained in the video below. If not, you may continue reading.
So Text data is essentially a written form of a natural language such as Hindi, English, Russian, or any other. It consists of characters or words arranged together in a meaningful and ordered manner. This means that text data is driven by grammar rules. and defined structures.
In order to work with text data, it is important to transform the raw text into a form that can be understood and used by machine learning algorithms, this is called text preprocessing.
In this article, we are going to discuss different terms and techniques related to natural language processing.
First Let’s discuss the meaning of some important terms like Corpus, Token, and N-grams.
Corpus is the collection of text documents. For example, a dataset consists of the news article in a corpus. Similarly, Twitter data containing tweets is a corpus.
So Corpus consists of Documents, Documents contain Paragraphs in turn Paragraph consists of Sentences and finally, Sentences comprises of Tokens.
Tokens are a basic meaningful unit of a sentence or a document. They can consist of words, phrases, subwords like n-grams, or characters.
N-grams is a combination of N words or characters together. for example, if we have a sentence “I Love My Phone” we can break this sentence into multiple n-grams as shown below.
Unigrams have only one token like I, love. whereas bigrams will have a combination of tokens in a sequence, for example, I Love, Love MY. Similarly, trigrams will have a combination of three tokens together.
N-grams are very useful in text classification tasks.
Now we have a clear idea about the basic terms. Let’s see the few techniques used in text data preprocessing.
Tokenization is the process of splitting a text object into smaller units known as tokens. Examples of tokens can be words, characters, numbers, symbols, or n-grams.
The most common tokenization process is whitespace/ unigram tokenization. In this process entire text is split into words by splitting them from whitespaces. As shown in the example below the whole sentence is split into unigrams i.e “I”,” Went”,”To”,”New-York” etc.
Notice that New-York is not split further as the tokenization is based on whitespaces only.
Another type of Tokenization is regular expression tokenization. In which a regular expression pattern is used to get the tokens. For example, consider the following string containing multiple delimiters. We can split the sentence by passing a splitting pattern.
The tokenization can be performed at the sentence level or at the word level or even at the character level.
In the field of linguistics and NLP, Morpheme is defined as a base form of the word. A token is basically made up of two components one is morphemes and the other is inflectional formlike prefix or suffix.
For example, consider the word Antinationalist (Anti + national+ ist ) which is made up of Anti and ist as inflectional forms and national as the morpheme.
Normalization is the process of converting a token into its base form. In the normalization process, the inflectional form of a word is removed so that the base form can be obtained. So in our above example, the normal form of antinationalist is national.
Normalization is helpful in reducing the number of unique tokens present in the text, removing the variations in a text. and also cleaning the text by removing redundant information.
Two popular methods used for normalization are stemming and lemmatization. Let’s discuss them in detail.
It is an elementary rule-based process for removing inflationary forms from a given token. The output of the error is the stem of a word. for example laughing, laughed, laughs, laugh all will become laugh after the stemming process.
Stemming is not a good process for normalization. since sometimes it can produce non-meaningful words which are not present in the dictionary. Consider the sentence ” His teams are not winning”. After stemming we get “Hi team are not winn ” . Notice that the keyword winn is not a regular word. Also, “hi” has changed the context of the entire sentence.
Lemmatization is a systematic process of removing the inflectional form of a token and transform it into a lemma. It makes use of word structure, vocabulary, part of speech tags, and grammar relations.
The output of lemmatization is a root word called a lemma. for example “am”, “are”, “is” will be converted to “be”. Similarly, running runs, ‘ran’ will be replaced by ‘run’.
Also, since it is a systematic process while performing lemmatization one can specify the part of the speech tag for the desired term.
Further, Lemmatization can only be performed if the given word has proper parts of speech tag. for instance if we try to lemmatize the word running as a verb it will be converted to run. But, if we try to lemmatize the same word running as a noun it won’t be transformed.
This is all about tokenization and text normalization in this article. To summarize in this article we saw the techniques used for text data preprocessing. Further, we saw the text normalization using stemming and lemmatization and also how these two are different from each other.
If you are looking to kick start your Data Science Journey and want every topic under one roof, your search stops here. Check out Analytics Vidhya’s Certified AI & ML BlackBelt Plus Program
If you have any questions, let me know in the comments section!