Text Preprocessing in Python -Getting started with NLP

Shilpijs 11 Aug, 2021 • 12 min read
This article was published as a part of the Data Science Blogathon 

Text Preprocessing in python

Introduction

NLP or Natural Language Processing is the science of processing, understanding, and generating human language by machines. Using NLP, information can be extracted from unstructured data, trained to generate responses for human queries, classify text into appropriate categories. News articles, social media posts, online reviews are some of the publicly available sources that are rich in information. NLP is used to derive meaningful insights from these sources but training NLP algorithms directly on the text, in its free form, can induce a lot of noise and add unnecessary complexity. To derive meaningful insights from such unstructured data, it needs to be cleansed, brought to an appropriate level for analysis.

This article covers some of the widely used preprocessing steps and provides an understanding of the structure and vocabulary of the text, along with their code in python. The exact list of steps depends on the quality of the text, the objective of the study, and the NLP task to be performed.

  • Sentence Segmentation
  • Part of Speech Tagging
  • Removal of Special Characters
  • Removal of Stop Words
  • Removal of White Spaces
  • Document Term Matrix

Below is a corpus on NLP:

Document 1 – Natural Language Processing (NLP) is a field within Artificial Intelligence (AI) that is concerned with how computers deal with human language. We are already interacting with such machines in our day-to-day life in the form of IVRs & chat-bots. But do Machines really understand human language, context, syntax, semantics, etc.? Yes, they can be trained to do so!

Document 2 – Google has trained its search engine to make autofill recommendations as text is typed using NLP. Google’s search engine has the capability of

understanding the meaning of words depending on the context in the search. Google’s “state-of-the-art” search engine is one of the most sophisticated examples of NLP.

Document 3 – Origination of Natural Language Processing dates back to the II world war when there was a need for machine translation between Russian & English. Today, NLP has expanded beyond these two languages and can deal with most languages, including sign language.

Document 4 – NLP is actively used for a variety of day-to-day activities like spam detection, recruitment, smart assistants, understanding customer behaviour & so on…… Usage and impact of NLP are growing exponentially, across a wide range of industries.

Document 5 – Acronym NLP is used for both Natural Language Processing & Neuro-Linguistic Programming, but, these are completely different fields of science. Neuro-Linguistic Programming deals with human-to-human interaction, whereas Natural Language Processing deals with human-to-computer interaction.

Loading above in a pandas data frame in python

   Document ID                                               Text
0            1  Natural Language Processing (NLP) is a field w...
1            2  Google has trained it's search engine to make ...
2            3          Origination of Natural Language Proces...
3            4  NLP is actively used for a variety of day-to-d...
4            5   Acronym  NLP is used for both Natural Languag...

 

Sentence Segmentation

Breaking paragraphs into sentences make them manageable as well as the context within the sentence can be better understood.

In the first document, the first couple of sentences are informative, followed by a question in raising tone. The tone of the individual sentence gets masked, when looking at the complete extract and cannot be deciphered when further broken into tokens.

 

from nltk.tokenize import sent_tokenize
# breaking every document into sentences
doc_w_sent = [sent_tokenize(text) for text in text_doc.text]
# creating document ID & sentence ID for reference
doc_num_list = [[x] * y for x, y in zip(text_doc.doc_id, [len(doc) for doc in doc_w_sent])]
sentence_num_list = [list(range(1, len(doc)+1)) for doc in doc_w_sent]
# un-nesting lists
doc_w_sent = [x for element in doc_w_sent for x in element]
doc_num_list = [x for element in doc_num_list for x in element]
sentence_num_list = [x for element in sentence_num_list for x in element]
# creating dataframe
text_data = pd.DataFrame({'Document ID' : doc_num_list, 'Sentence ID' : sentence_num_list, 'Text' : doc_w_sent})
print(text_data)

NLTK is a python package for NLP tasks. As the name suggests, ‘sent_tokenize’ breaks paragraphs into sentences based on the end of line punctuation marks – period, question mark, and exclamation mark.

Output:
    Document ID  Sentence ID                                               Text
0             1            1  Natural Language Processing (NLP) is a field w...
1             1            2  We are already interacting with such machines ...
2             1            3  But do Machines really understand human langua...
3             1            4                 Yes, they can be trained to do so!
4             2            1  Google has trained it's search engine to make ...
5             2            2  Google's search engine has the capability of ...
6             2            3  Google's "state-of-the-art" search engine is o...
7             3            1          Origination of Natural Language Proces...
8             3            2  Today, NLP has expanded beyond these two langu...
9             4            1  NLP is actively used for a variety of day-to-d...
10            4            2  Usage and impact of NLP is growing exponential...
11            5            1   Acronym  NLP is used for both Natural Languag...
12            5            2  Neuro Linguistic Programming deals with human-...

A list is created with each sentence as an individual list element. Through “Document ID” and “Sentence ID”, it can be inferred that documents 1 through 5 contain 4, 3, 2, 2, 2 sentences respectively.

 

Part of Speech Tagging

Part of Speech tagging is the process of assigning labels (part of speech) to words in a sentence given the context it’s used in and on the meaning of the word. It’s critical for Named Entity Recognition (NER), understanding the relationship between words, developing linguistic rules, lemmatization.

import nltk
pos_tag = [nltk.pos_tag(nltk.word_tokenize(sent)) for sent in text_data.Text]
print(pos_tag)

NLTK has part of speech tagging and is run on every word in the sentence.

[[('Natural', 'JJ'), ('Language', 'NNP'), ('Processing', 'NNP'), ('(', '('), ('NLP', 'NNP'), (')', ')'), ('is', 'VBZ'), ('a', 'DT'), ('field', 'NN'), ('within', 'IN'), ('Artificial', 'JJ'), ('Intelligence', 'NNP'), ('(', '('), ('AI', 'NNP'), (')', ')'), ('that', 'WDT'), ('is', 'VBZ'), ('concerned', 'VBN'), ('with', 'IN'), ('how', 'WRB'), ('computers', 'NNS'), ('deal', 'VBP'), ('with', 'IN'), ('human', 'JJ'), ('language', 'NN'), ('.', '.')], [('We', 'PRP'), ('are', 'VBP'), ('already', 'RB'), ('interacting', 'VBG'), ('with', 'IN'), ('such', 'JJ'), ('machines', 'NNS'), ('in', 'IN'), ('our', 'PRP$'), ('day-to-day', 'JJ'), ('life', 'NN'), ('in', 'IN'), ('the', 'DT'), ('form', 'NN'), ('of', 'IN'), ('IVRs', 'NNP'), ('&', 'CC'), ('chat-bots', 'NNS'), ('.', '.')], [('But', 'CC'), ('do', 'VBP'), ('Machines', 'NNS'), ('really', 'RB'), ('understand', 'VBP'), ('human', 'JJ'), ('language', 'NN'), (',', ','), ('context', 'NN'), (',', ','), ('syntax', 'NN'), (',', ','), ('semantics', 'NNS'), (',', ','), ('etc', 'FW'), ('.', '.'), ('?', '.')], [('Yes', 'UH'), (',', ','), ('they', 'PRP'), ('can', 'MD'), ('be', 'VB'), ('trained', 'VBN'), ('to', 'TO'), ('do', 'VB'), ('so', 'RB'), ('!', '.')], [('Google', 'NNP'), ('has', 'VBZ'), ('trained', 'VBN'), ('it', 'PRP'), ("'s", 'VBZ'), ('search', 'JJ'), ('engine', 'NN'), ('to', 'TO'), ('make', 'VB'), ('autofill', 'JJ'), ('recommendations', 'NNS'), ('as', 'IN'), ('text', 'NN'), ('is', 'VBZ'), ('typed', 'VBN'), ('using', 'VBG'), ('NLP', 'NNP'), ('.', '.')], [('Google', 'NNP'), ("'s", 'POS'), ('search', 'NN'), ('engine', 'NN'), ('has', 'VBZ'), ('the', 'DT'), ('capability', 'NN'), ('of', 'IN'), ('understanding', 'VBG'), ('meaning', 'NN'), ('of', 'IN'), ('words', 'NNS'), ('depending', 'VBG'), ('on', 'IN'), ('the', 'DT'), ('context', 'NN'), ('in', 'IN'), ('the', 'DT'), ('search', 'NN'), ('.', '.')], [('Google', 'NNP'), ("'s", 'POS'), ('``', '``'), ('state-of-the-art', 'JJ'), ("''", "''"), ('search', 'NN'), ('engine', 'NN'), ('is', 'VBZ'), ('one', 'CD'), ('of', 'IN'), ('the', 'DT'), ('most', 'RBS'), ('sophisticated', 'JJ'), ('examples', 'NNS'), ('of', 'IN'), ('NLP', 'NNP'), ('.', '.')], [('Origination', 'NN'), ('of', 'IN'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('dates', 'VBZ'), ('back', 
'RB'), ('to', 'TO'), ('II', 'NNP'), ('world', 'NN'), ('war', 'NN'), (',', ','), ('when', 'WRB'), ('there', 'EX'), ('was', 'VBD'), ('a', 'DT'), ('need', 'NN'), ('for', 'IN'), ('machine', 'NN'), ('translation', 'NN'), ('between', 'IN'), ('Russian', 'NNP'), ('&', 'CC'), ('English', 'NNP'), ('.', '.')], [('Today', 'NN'), (',', ','), ('NLP', 'NNP'), ('has', 'VBZ'), ('expanded', 'VBN'), ('beyond', 'IN'), ('these', 'DT'), ('two', 'CD'), ('languages', 'NNS'), ('and', 'CC'), ('can', 'MD'), ('deal', 'VB'), ('with', 'IN'), ('most', 'JJS'), ('languages', 'NNS'), (',', ','), ('including', 'VBG'), ('sign', 'JJ'), ('language', 'NN'), ('.', '.')], [('NLP', 'NNP'), ('is', 'VBZ'), ('actively', 'RB'), ('used', 'VBN'), ('for', 'IN'), ('a', 'DT'), ('variety', 'NN'), 
('of', 'IN'), ('day-to-day', 'JJ'), ('activities', 'NNS'), ('like', 'IN'), ('spam', 'NN'), ('detection', 'NN'), (',', ','), ('recruitment', 'NN'), (',', ','), ('smart', 'JJ'), ('assistants', 'NNS'), (',', ','), ('understanding', 'VBG'), ('customer', 'NN'), ('behavior', 'NN'), ('&', 'CC'), ('so', 'RB'), ('on', 'IN'), ('.', '.')], [('Usage', 'NN'), ('and', 'CC'), ('impact', 'NN'), ('of', 'IN'), ('NLP', 'NNP'), ('is', 'VBZ'), ('growing', 'VBG'), ('exponentially', 'RB'), (',', ','), ('across', 'IN'), ('a', 'DT'), ('wide', 'JJ'), ('range', 'NN'), ('of', 'IN'), ('industries', 'NNS'), ('.', '.')], [('Acronym', 'NNP'), ('NLP', 'NNP'), ('is', 'VBZ'), ('used', 'VBN'), ('for', 'IN'), ('both', 'DT'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('&', 'CC'), ('Neuro', 'NNP'), ('Linguistic', 'NNP'), ('Programming', 'NNP'), (',', ','), ('but', 'CC'), (',', ','), ('these', 'DT'), ('are', 'VBP'), ('completely', 'RB'), ('different', 'JJ'), ('fields', 'NNS'), ('of', 'IN'), ('science', 'NN'), ('.', '.')], [('Neuro', 'NNP'), ('Linguistic', 'NNP'), ('Programming', 'NNP'), ('deals', 'NNS'), ('with', 'IN'), ('human-to-human', 'JJ'), ('interaction', 'NN'), (',', ','), ('where', 'WRB'), ('as', 'IN'), ('Natural', 'NNP'), ('Language', 'NNP'), ('Processing', 'NNP'), ('deals', 'NNS'), ('with', 'IN'), ('human-to-computer', 'JJ'), ('interaction', 'NN'), ('.', '.')]]

Output is displayed in a list of tuples for every element in the sentence. In the first sentence of the first document, the word “processing” is used as a proper noun (NNP). But, the word “processing” can also be used as a verb, in the context of “I am processing the data.”. For machines to completely understand the meaning of the text, identifying the correct part of speech becomes important.

 

Removal of Special Characters

Special characters usually do not add value to the text. One way they can be utilized is to parse text upon the occurrence of any particular special character or indicate the need for words expansions. Once they have been utilized to treat the text, they can be removed from the text to exclude any redundant information being processed through NLP algorithms.

from string import punctuation

import re
text_data.Text = [re.sub('['+punctuation+']', ' ', sent) for sent in text_data.Text]
[print(sent) for sent in text_data.Text]

‘string’ package contains following punctuation marks !”#$%&'()*+,-./:;?@[]^_`{|}~. This list can be modified by adding/removing any character.

Output:

Natural Language Processing  NLP  is a field within Artificial Intelligence  AI     that is concerned with how computers deal with human language
We are already interacting with such machines in our day to day     life in the form of IVRs   chat bots
But do Machines really understand human language  context  syntax  semantics  etc
Yes  they can be trained to do so
Google has trained it s search engine to make autofill recommendations as text is typed using NLP
Google s search engine has the capability of
understanding meaning of words depending on the context in the search
Google s  state of the art  search engine is one of the most sophisticated examples of NLP
        Origination of Natural Language Processing dates back to II world war  when there was a need for machine translation between Russian   English
Today  NLP has expanded beyond these two languages and can deal with most languages  including sign language
NLP is actively used for a variety of day to day activities like spam detection  recruitment  smart assistants  understanding customer behavior   so on
Usage and impact of NLP is growing exponentially  across a wide range of industries
 Acronym  NLP is used for both Natural Language Processing   Neuro Linguistic Programming  but  these are completely different fields of science
Neuro Linguistic Programming deals with human to human interaction  where as Natural Language Processing deals with human to computer interaction

Punctuation marks are replaced by a single space.

 

Removal of Stop Words

Like special characters, certain words do not add any value to the text. These are called stop words. They can belong to any part of speech. Usually, there is a general list of stop words that can be used for any NLP task but it can be modified depending on the text.

For instance, finding “with” as the most frequent token or even, finding a noun “language” as the most common token in the above paragraph is not useful information. These redundant words should be removed to generate truly valuable insights.

from nltk.corpus import stopwords
 import re
 stop_words = set(stopwords.words('english'))
 stopwords_all = "("+') | ('.join([s for s in stop_words])+")"
 text_data.Text = [re.sub(stopwords_all,' ',sent) for sent in text_data.Text]
 [print(sent) for sent in text_data.Text]

NLTK contains a predefined set of stop words for various languages. This set can be altered to add/remove any word, depending on the context and quality of the text.

Natural Language Processing  NLP      field within Artificial Intelligence  AI         concerned     computers deal   human language
We   already interacting     machines     day   day     life     form   IVRs   chat bots
But   Machines really understand human language  context  syntax  semantics  etc
Yes        trained
Google   trained     search engine   make autofill recommendations   text   typed using NLP
Google   search engine     capability
understanding meaning   words depending     context     search
Google    state     art  search engine   one       sophisticated examples   NLP
        Origination   Natural Language Processing dates back   II world war          need   machine translation   Russian   English
Today  NLP   expanded beyond   two languages     deal     languages  including sign language
NLP   actively used     variety   day   day activities like spam detection  recruitment  smart assistants  understanding customer behavior
Usage   impact   NLP   growing exponentially  across   wide range   industries
 Acronym  NLP   used     Natural Language Processing   Neuro Linguistic Programming         completely different fields   science
Neuro Linguistic Programming deals   human   human interaction      Natural Language Processing deals   human   computer interaction

Helping verbs such as “is” / “are”, determiners such as “that” / “they”, prepositions such as “with”, adjective such as “these” have been removed. Comparing the first sentence of the first document, “is a”, “that is”, “with”, “how” have been dropped.

 

Removal of White Spaces

White space characters are very commonly found in any text. They can also be introduced in the text as part of other cleansing exercises. These are unnecessary and irrelevant. Below code can be used to remove any such white space character.

import re
text_data.Text = [re.sub(r's+|t+|n+|r+|f+',' ',sent).strip() for sent in text_data.Text]
[print(sent) for sent in text_data.Text]

s, t, n, r, f, respectively, represents space, tab, carriage return, line feed and form feed. ‘strip()’ function removes any leading and trailing spaces.

Output:

Natural Language Processing NLP field within Artificial Intelligence AI concerned computers deal human language
We already interacting machines day day life form IVRs chat bots
But Machines really understand human language context syntax semantics etc
Yes trained
Google trained search engine make autofill recommendations text typed using NLP
Google search engine capability understanding meaning words depending context search
Google state art search engine one sophisticated examples NLP
Origination Natural Language Processing dates back II world war need machine translation Russian English
Today NLP expanded beyond two languages deal languages including sign language
NLP actively used variety day day activities like spam detection recruitment smart assistants understanding customer behavior
Usage impact NLP growing exponentially across wide range industries
Acronym NLP used Natural Language Processing Neuro Linguistic Programming completely different fields science
Neuro Linguistic Programming deals human human interaction Natural Language Processing deals human computer interaction

Multiple white space characters are replaced by a single space and any leading and trailing blanks are removed.

 

Document Term Matrix

Document Term matrix provides structure to the unstructured data. It’s a basic way of creating fixed-length input for machine learning algorithms. Every document is represented as a row and every token as a column. The two most common value sets are count and TF-IDF. As the name suggests, the count is the total number of occurrences of every token in a document. It assesses how common or rare a word is in a document. TF-IDF (Term Frequency – Inverse Document Frequency) is Term Frequency (proportion of count of token in a document by a count of all tokens in the document) weighted by Inverse Document Frequency (inverse of the total number of the documents containing the token). It’s used to find tokens that can be used to distinguish and classify documents.

# roll up cleansed sentences at document level
text_doc_cleansed = text_data.groupby('Document ID')['Text'].apply(list)
text_doc_cleansed = [' '.join(doc) for doc in text_doc_cleansed]
print(text_doc_cleansed)

Rolled up sentences at the document level.

['Natural Language Processing NLP field within Artificial Intelligence AI concerned computers deal human language We already interacting machines day day life form IVRs chat bots But Machines really understand human language context syntax semantics etc Yes trained', 'Google trained search engine make autofill recommendations text typed using NLP Google search engine capability understanding meaning words depending context search Google state art search engine one sophisticated examples NLP', 'Origination Natural Language Processing dates back II world war need machine translation Russian English Today NLP expanded beyond two languages deal languages including sign language', 'NLP actively used variety day day activities like spam detection recruitment smart assistants understanding customer behavior Usage impact NLP growing exponentially across wide range industries', 'Acronym NLP used Natural Language Processing Neuro Linguistic 
Programming completely different fields science Neuro Linguistic Programming deals human human interaction Natural Language Processing deals human computer interaction']

Count

from sklearn.feature_extraction.text import CountVectorizer
countvectorizer = CountVectorizer()
countvectors = countvectorizer.fit_transform(text_doc_cleansed)
countfeature_names = countvectorizer.get_feature_names()
countdense = countvectors.todense()
countdenselist = countdense.tolist()
count_df = pd.DataFrame(countdenselist, columns=countfeature_names)
print(count_df)

Output:

   acronym  across  actively  activities  ai  already  art  artificial  assistants  autofill  back  ...  usage  used  using  variety  war  we  wide  within  words  world  yes
0        0       0         0           0   1        1    0           1           0         0     0  ...      0     0      0        0    0   1     0       1      0      0    1
1        0       0         0           0   0        0    1           0           0         1     0  ...      0     0      1        0    0   0     0       0      1      0    0
2        0       0         0           0   0        0    0           0           0         0     1  ...      0     0      0        0    1   0     0       0      0      1    0
3        0       1         1           1   0        0    0           0           1         0     0  ...      1     1      0        1    0   0     1       0      0      0    0
4        1       0         0           0   0        0    0           0           0         0     0  ...      0     1      0        0    0   0     0       0      0      0    0

Output is a dataframe with 5 rows representing each document and 113 columns representing each token.

TF-IDF

from sklearn.feature_extraction.text import CountVectorizer,TfidfVectorizer
tfidfvectorizer = TfidfVectorizer()
tfidfvectors = tfidfvectorizer.fit_transform(text_doc_cleansed)
tfidffeature_names = tfidfvectorizer.get_feature_names()
tfidfdense = tfidfvectors.todense()
tfidfdenselist = tfidfdense.tolist()
tfidf_df = pd.DataFrame(tfidfdenselist, columns=tfidffeature_names)
print(tfidf_df)

Output:

    acronym    across  actively  activities        ai   already       art  artificial  assistants  ...     using   variety       war        we      wide    within     words     world       yes 
0  0.000000  0.000000  0.000000    0.000000  0.161541  0.161541  0.000000    0.161541    0.000000  ...  0.000000  0.000000  0.000000  0.161541  0.000000  0.161541  0.000000  0.000000  0.161541
1  0.000000  0.000000  0.000000    0.000000  0.000000  0.000000  0.138861    0.000000    0.000000  ...  0.138861  0.000000  0.000000  0.000000  0.000000  0.000000  0.138861  0.000000  0.000000
2  0.000000  0.000000  0.000000    0.000000  0.000000  0.000000  0.000000    0.000000    0.000000  ...  0.000000  0.000000  0.201746  0.000000  0.000000  0.000000  0.000000  0.201746  0.000000
3  0.000000  0.204921  0.204921    0.204921  0.000000  0.000000  0.000000    0.000000    0.204921  ...  0.000000  0.204921  0.000000  0.000000  0.204921  0.000000  0.000000  0.000000  0.000000
4  0.161969  0.000000  0.000000    0.000000  0.000000  0.000000  0.000000    0.000000    0.000000  ...  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000  0.000000

Dimension of the data frame with TF-IDF is the same as that of with count. To compare the vectorization, output with count as value set has tokens with the same appearances within a document as well as across documents, represented by the same vector. However, with the TF-IDF value set, tokens with the same occurrences could be represented by different vectors depending on the length of the document. As mentioned earlier, tokens with the same number of occurrences in a document could also be represented by different vectors, depending on the number of documents they appear in.

Word Cloud

Word cloud is a text visualization technique where tokens are printed, where their size represents their frequency or importance in the text.

import matplotlib.pyplot as plt
from wordcloud import WordCloud
wordcloud = WordCloud(background_color="white", width=3000, height=2000, max_words=500).generate_from_frequencies(tfidf_df.T.sum(axis=1))
plt.imshow(wordcloud)
word cloud

Conclusion

The above steps are not exhaustive to complete text cleansing and gain a full understanding of its structure, syntax, and semantics. There are more tasks such as changing cases, expanding contractions, harmonizing text (where the same entity is represented in more than one way), spell check, dependency parsing, and so on. The steps to be performed depend on the NLP objective. For instance, in the case of text summarization/classification, the text would be studied in its entirety, whereas for Named Entity Recognition / Parts of Speech tagging, paragraphs might be broken into sentences/tokens. Quality of text and objective of study play a huge role in determining the level of preprocessing.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Shilpijs 11 Aug 2021

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Related Courses

Natural Language Processing
Become a full stack data scientist