NLP Preprocessing Steps in Easy Way

Shankar297 31 May, 2022 • 5 min read

This article was published as a part of the Data Science Blogathon.

Introduction on NLP Preprocessing

Hello friends, In this article, we will discuss text preprocessing techniques used in NLP. In any Machine learning task, cleaning or preprocessing the data is as important as model building. Text data is one of the most unstructured forms of available data and when comes to dealing with human language then it’s too complex. Have you ever wondered how Alexa, Siri, and Google assistants can understand, process, and respond to Human language. NLP is a technology that works behind it where before any response lots of text preprocessing takes place. This tutorial will study the main text preprocessing techniques that you must know to work with any text data.


What is Natural Language? and What is NLP(Natural language processing)?

Natural Language — Any language in which humans are making a conversation that language is the Natural language.

NLP — Natural language processing (NLP) is a field of artificial intelligence in which computers analyze, understand, and derive meaningful information from human language in a smart and useful way.

What is Text Analytics/ Text Mining?

Text Analytics is the process of deriving meaningful information from the natural language’s text. Text Analytics is Nothing but the preprocessing that is required before giving the data to the model.

Why do we need to do text preprocessing?

It helps to remove unhelpful parts of the data, or noise, by converting all characters to lowercase, removing stopwords, punctuation marks, and typos which available in the data. After doing data preprocessing accuracy of the model get increases.

Text Preprocessing Techniques

1. Lowercasing

As we know python is case sensitive language. if Ram and ram words are present in our data. then these two words process separately that’s why we convert all data in lower case.

df['text'].apply(lambda x:x.lower())

2. Remove HTML Tags

HTML tags are not important in model building. We have to remove HTML tags.

For removing HTML tags we use regex.

import re
def remove_html_tags(text):
    pattern = re.compile('')
    return pattern.sub(r'', text)


Remove HTML Tags

3. Remove URLs

URLs are not important in model building. We have to remove URLs.

For removing URLs we can use regex.

def remove_url(text):
	pattern = re.compile('https?://S+|www.S+')
      	return pattern.sub(r'', text)


Remove URLs

4. Removing Punctuation

If we did not remove punctuation then punctuation is also considered one word for this situation we remove punctuation.

import string, time
exclude = string.punctuation
def remove_punct(text):
	for char in exclude:
		text = text.replace(char, '')
        return text

If we define our function then it takes lots of time. maketrans function takes less time.

def remove_punct(text):
         return text.translate(str.maketrans('','',exclude)


Removing Punctuation

5. Chat word Treatment

In normal chatting we use short abbreviation of words. We have to change this short form to full form.

# Add chat words in form of dictionary
chat_words = {}
def chat_conversion(text):
    new_text = []
    for w in text.split():
        if w.upper() in chat_words:
    return ' '.join(new_text)


6. Spelling Correction

from textblob import TextBlob
incorrect_text = 'any tezt with for checing'
textblob = TextBlob(incorrect_text)


Spelling Correction

7. Removing Stop Words

Stop words are only for sentence formation but in the meaning of the sentence, stop words are not important.

from nltk.corpus import stopwords


Removing Stop Words

8. Handling Emojis

Emojis are not understood by the machine learning model.

Two approaches we can use to handle emojis:

  1. Remove Emojis
  2. Change with meaning

9. Tokenization

In tokenization, we break data into tokens. We can do word tokenization or sentence tokenization.

sent = 'I am going to Delhi. With my parents.'

1. Using the split function


2. Using NLTK Library

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
Using NLTK Library

3. Using Spacy Library

import spacy
nlp = spacy.load('en_core_web_sm')
doc1 = nlp(sent)
for token in doc1:

10. Stemming/Lemmatization

Stemming – Stemming means mapping a group of words to the same stem by removing prefixes or suffixes without giving any value to the “grammatical meaning” of the stem formed after the process.

In simple word-stemming remove suffixes and prefixes from the word.

Stemmer — It is an algorithm to do stemming
1. Porter Stemmer — specific for the English language
2. Snowball Stemmer — used for multiple languages
3. Lancaster Stemmer

#importing the Stemming function from nltk library
from nltk.stem.porter import PorterStemmer
#defining the object for stemming
porter_stemmer = PorterStemmer()
sent = 'History is the best subject for teaching'
tokens = nltk.word_tokenize(sent)
for word in tokens:



Lemmatization— Lemmatization also does the same thing as stemming and tries to bring a word to its base form, but unlike stemming it does keep into account the actual meaning of the base word.

In Lemmatization we search words in wordnet.‘wordnet’)
from nltk.stem import WordNetLemmatizer
WordNetLemmatizer = WordNetLemmatizer()
sent = ‘History is the best subject for teaching’
tokens = nltk.word_tokenize(sent)
for word in tokens:
print(word,’—->’, WordNetLemmatizer.lemmatize(word, pos=’v’))


In this article, we learned different NLP Preprocessing Steps. The key takeaways from the article are,

  • We learned different Preprocessing steps such as lowercasing, removing HTML tags, tokenization, removing stopwords and punctuation, etc.
  • After cleaning data using the above-preprocessing steps model accuracy may be increased.
  • Various python libraries like nltk, spaCy, and TextBlob can be used.

So, this was all about Preprocessing Steps. Hope you liked the article.

 The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Shankar297 31 May 2022

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers


  • [tta_listen_btn class="listen"]