Text Preprocessing in NLP with Python Codes

Deepanshi 12 Jun, 2024 • 7 min read

Text preprocessing is an essential step in natural language processing (NLP) that involves cleaning and transforming unstructured text data to prepare it for analysis. It includes tokenization, stemming, lemmatization, stop-word removal, and part-of-speech tagging. In this article, we will introduce the basics of text preprocessing and provide Python code examples to illustrate how to implement these tasks using the NLTK library. By the end of the article, readers will better understand how to prepare text data for NLP tasks.

Learning Outcomes

  • Learn about the essential steps in text preprocessing using Python, including tokenization, stemming, lemmatization, and stop-word removal.
  • Discover the importance of text preprocessing in improving data quality and reducing noise for effective NLP analysis.
  • You can learn how to clean and prepare text data using Python and the NLTK library with practical code examples.
  • Explore the differences between stemming and lemmatization and their impact on word meaning and context.
  • Understand the application of preprocessing techniques on SMS spam data to prepare it for model building.

This article was published as a part of the Data Science Blogathon

What is Text Preprocessing in NLP?

Natural Language Processing (NLP) is a branch of Data Science that deals with text data. Apart from numerical data, text data is available to a great extent and is used to analyze and solve business problems. However, before using the data for analysis or prediction, processing the data is important.

We perform text preprocessing to prepare the text data for the model building. It is the very first step of NLP projects. Some of the preprocessing steps are:

  • Removing punctuations like . , ! $( ) * % @
  • Removing URLs
  • Removing Stop words
  • Lower casing
  • Tokenization
  • Stemming
  • Lemmatization

Why is Text Preprocessing important?

Text preprocessing is crucial in natural language processing (NLP) for several reasons:

Preprocessing TaskReasons
Noise ReductionText data often contains noise such as punctuation, special characters, and irrelevant symbols. Preprocessing helps remove these elements, making the text cleaner and easier to analyze.
NormalizationDifferent forms of words (e.g., “run,” “running,” “ran”) can convey the same meaning but appear in different forms. Preprocessing techniques like stemming and lemmatization help standardize these variations.
TokenizationText data needs to be broken down into smaller units, such as words or phrases, for analysis. Tokenization divides text into meaningful units, facilitating subsequent processing steps like feature extraction.
Stopword RemovalStopwords are common words like “the,” “is,” and “and” that often occur frequently but convey little semantic meaning. Removing stopwords can improve the efficiency of text analysis by reducing noise.
Feature ExtractionPreprocessing can involve extracting features from text, such as word frequencies, n-grams, or word embeddings, which are essential for building machine learning models.
Dimensionality ReductionText data often has a high dimensionality due to the presence of a large vocabulary. Preprocessing techniques like term frequency-inverse document frequency (TF-IDF) or dimensionality reduction methods can help.

Text preprocessing is crucial in preparing text data for NLP tasks. It improves data quality, reduces noise, and facilitates effective analysis and modeling.

SMS Spam Data for Text Preprocessing

We need to use the required steps based on our dataset. This article will use SMS spam data to understand the steps in text preprocessing in NLP using Python’s Pandas library.

Let’s start by importing the Pandas library and reading the data.

dataset | Text preprocessing
#expanding the dispay of text sms column
pd.set_option('display.max_colwidth', -1)
#using only v1 and v2 column
data= data [['v1','v2']]
data.head()
data set target | Text preprocessing

The data has 5572 rows and 2 columns. You can check the shape of data using data.shape function. Let’s check the dependent variable distribution between spam and ham.

#checking the count of the dependent variable
data['v1'].value_counts()
value counts

Steps to Clean the Data

Punctuation Removal

This step involves removing all the punctuation from the text. String library of Python contains some pre-defined list of punctuations such as ‘!”#$%&'()*+,-./:;?@[\]^_`{|}~’

#library that contains punctuation
import string
string.punctuation
#defining the function to remove punctuation
def remove_punctuation(text):
    punctuationfree="".join([i for i in text if i not in string.punctuation])
    return punctuationfree
#storing the puntuation free text
data['clean_msg']= data['v2'].apply(lambda x:remove_punctuation(x))
data.head()
punctuation removal

We remove all the punctuations from v2 and store them in the clean_msg column, as shown in the above output.

Lowering the Text

Converting the text into the same case, preferably lowercase, is one of Python’s most common text preprocessing steps. However, doing this step every time you work on an NLP problem is unnecessary, as lower casing can lead to a loss of information for some problems.

For example, when dealing with a person’s emotions in any project, words written in upper case can signify frustration or excitement.

data['msg_lower']= data['clean_msg'].apply(lambda x: x.lower())

Output: All the text of clean_msg column is converted into lowercase and stored in the msg_lower column

lowering the text | Text preprocessing

Tokenization

In this step, the text is split into smaller units. Based on our problem statement, we can use sentence or word tokenization.

#defining function for tokenization
import re
def tokenization(text):
    tokens = re.split('W+',text)
    return tokens
#applying function to the column
data['msg_tokenied']= data['msg_lower'].apply(lambda x: tokenization(x))

Output: Sentences are tokenized into words.

tokanization | Text preprocessing

Stop Word Removal

We remove commonly used stopwords from the text because they do not add value to the analysis and carry little or no meaning.

NLTK library consists of a list of stopwords considered stopwords in the English language. Some of them are : [i, me, my, myself, we, our, ours, ourselves, you, you’re, you’ve, you’ll, you’d, your, yours, yourself, yourselves, he, most, other, some, such, no, nor, not, only, own, same, so, then, too, very, s, t, can, will, just, don, don’t, should, should’ve, now, d, ll, m, o, re, ve, y, ain, aren’t, could, couldn’t, didn’t, didn’t]

However, using the provided list of stopwords is unnecessary, as they should be chosen wisely based on the project. For example, ‘How’ can be a stopword for a model but can be important for some other problem where we are working on customers’ queries. We can create a customized list of stopwords for different problems.

#importing nlp library
import nltk
#Stop words present in the library
stopwords = nltk.corpus.stopwords.words('english')
stopwords[0:10]
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
#defining the function to remove stopwords from tokenized text
def remove_stopwords(text):
    output= [i for i in text if i not in stopwords]
    return output
#applying the function
data['no_stopwords']= data['msg_tokenied'].apply(lambda x:remove_stopwords(x))

Output: Stop words in the nltk library, such as in, until, to, I, and here, are removed from the tokenized text, and the rest are stored in the no_stopwords column.

stop words removal | Text preprocessing

Stemming

This step, known as text standardization, stems or reduces words to their root or base form. For example, we stem words like ‘programmer,’ ‘programming,’ and ‘program’ to ‘program.’

However, stemming can cause the root form to lose its meaning or not reduce to a proper English word. We will see this in the steps below.

#importing the Stemming function from nltk library
from nltk.stem.porter import PorterStemmer
#defining the object for stemming
porter_stemmer = PorterStemmer()
#defining a function for stemming
def stemming(text):
stem_text = [porter_stemmer.stem(word) for word in text]
    return stem_text
data['msg_stemmed']=data['no_sw_msg'].apply(lambda x: stemming(x))

Output: In the below image, we can see how some words stem from their base.

crazy-> crazi

available-> avail

entry-> entri

early-> earli

stemming | Text preprocessing

Now let’s see how Lemmatization is different from Stemming.

Also Read: Stemming vs Lemmatization in NLP: Must-Know Differences

Lemmatization

It stems from the word but ensures it does not lose meaning.  Lemmatization has a pre-defined dictionary that stores the context of words and checks the word in the dictionary while diminishing.

Let us now understand the difference between after stemming and after lemmatization:

Original WordAfter StemmingAfter Lemmatization
goosegoosgoose
geesegeesgoose
from nltk.stem import WordNetLemmatizer
#defining the object for Lemmatization
wordnet_lemmatizer = WordNetLemmatizer()
#defining the function for lemmatization
def lemmatizer(text):
lemm_text = [wordnet_lemmatizer.lemmatize(word) for word in text]
    return lemm_text
data['msg_lemmatized']=data['no_stopwords'].apply(lambda x:lemmatizer(x))

Output: The difference between stemming and lemmatization can be seen in the output below.

In the first row- crazy has been changed to crazi which has no meaning, but for lemmatization, it remained the same i.e. crazy

In the last row- goes has changed to goe while stemming, but for lemmatization, it has converted into go, which is meaningful.

lemmatization | Text preprocessing

After performing all the text processing steps, we convert the final acquired data into numeric form using Bag of Words or TF-IDF.

Conclusion

Apart from the steps shown in this article, many other steps are a part of preprocessing. Some of them are URL removal, HTML tags removal, Rare words removal, Frequent words removal, Spelling checking, and many more. You must choose the steps based on the dataset you are working on and what is necessary for the project.

Frequently Asked Questions

Q1. What is text preprocessing in Python?

A. Text preprocessing in Python involves cleaning and transforming raw text data to make it suitable for analysis or machine learning tasks. It includes steps like removing punctuation, tokenization (splitting text into words or phrases), converting text to lowercase, removing stop words (common words that add little value), and stemming or lemmatization (reducing words to their base forms). Python libraries such as NLTK, SpaCy, and pandas are commonly used for these tasks.

Q2. How do you preprocess large text data in Python?

A. Preprocessing large text data in Python requires efficient handling of datasets using libraries like pandas for data manipulation and NLTK or SpaCy for text operations. Key steps include reading the data in chunks to manage memory, multiprocessing to parallelize tasks, and employing optimized methods like vectorized operations. Distributed computing frameworks like Apache Spark with PySpark to handle extremely large datasets can enhance performance.

Q3. What is lemmatization data preprocessing?

A. Lemmatization in data preprocessing reduces words to their base or root form (lemma) by considering the context and part of speech. Unlike stemming, which often cuts off word endings, lemmatization uses a dictionary to transform words into meaningful forms. For example, “running” becomes “run,” and “better” becomes “good.” It helps maintain meaningful word variants, improving the quality of text analysis and model performance.

Q4. What are the challenges of text preprocessing in NLP?

A. Text preprocessing in NLP faces challenges such as context sensitivity, scalability, language diversity, and data quality.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Deepanshi 12 Jun 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Ramya
Ramya 19 Jan, 2022

i have used spaCy to split text in a document into sentences. Is there a way to print accuracy score ?

sowjanya
sowjanya 26 Jul, 2022

its would be better if you provise the dataset for this example

Stephanie
Stephanie 05 May, 2023

I like the new design! I'm working on datasets of savant art drawings and the motor skill videos with tracking for datasets.

sara
sara 24 Aug, 2023

Can I please ask the dataset of this practice?

Natural Language Processing
Become a full stack data scientist