Shivani Sharma — June 27, 2021
Beginner Machine Learning NLP Project Python Text

This article was published as a part of the Data Science Blogathon

Machines understanding language fascinates me, and that I often ponder which algorithms Aristotle would have accustomed build a rhetorical analysis machine if he had the possibility. If you’re new to Data Science, getting into NLP can seem complicated, especially since there are many recent advancements within the field. it’s hard to grasp where to begin.

Table of Contents

1.What can Machines Understand?

2.Project 1:Word Cloud

3.Project 2:Spam Detection

4.Project 3:Sentiment Analysis


What can Machines Understand?

While a computer can be quite good at finding patterns and summarizing documents, it must transform words into numbers before making sense of them. This transformation is highly required because math doesn’t work very well on words and machines “learn” thanks to mathematics. Before the transformation of the words into numbers, Data cleaning is required. Data cleaning includes the removal of special characters and punctuation and modified into forms that make them more uniform and interpretable.

Project 1: Word Cloud

1.Importing Dependencies and Data

Start by importing the dependencies and the data. The data is stored as a comma-separated values (CSV) file, so I will use pandas’ read_csv() function to open it into a DataFrame.

import pandas as pd
import sqlite3
import regex as re
import matplotlib.pyplot as plt
from wordcloud import WordCloud
#create dataframe from csv
df = pd.read_csv('emails.csv')

Natural language processing | dataNatural language processing | data2

2.Exploratory Analysis

To eliminate duplicate rows and establish some baseline counts, it is best to do a quick analysis of the data. Here we use pandas drop_duplicates to drop the duplicate rows.

print("spam count: " +str(len(df.loc[df.spam==1])))
print("not spam count: " +str(len(df.loc[df.spam==0])))
df['spam'] = df['spam'].astype(int)
df = df.drop_duplicates()
df = df.reset_index(inplace = False)[['text','spam']]

target count

Counts and shape before/after duplicate removal

What is a Word Cloud?

Word clouds make understanding word frequencies easier so, it is a useful way to visualize text data. Words that appear larger in the cloud are those which appear more frequently within the email text. Word Clouds make it easy to identify “keywords.”

wordcloud | Natural language processing

Word Cloud Examples

All the text is lower case in the word cloud image. It does not contain any punctuation marks or special characters. The text now is called cleaned and ready for analysis. With the help of regular expressions, it is easy to clean the text using a loop:

clean_desc = []
for w in range(len(df.text)):
   desc = df['text'][w].lower()
   #remove punctuation
   desc = re.sub('[^a-zA-Z]', ' ', desc)
   #remove tags
   desc=re.sub("</?.*?>"," <> ",desc)
   #remove digits and special chars
   desc=re.sub("(\d|\W)+"," ",desc)
#assign the cleaned descriptions to the data frame
df['text'] = clean_desc

lower case | Natural language processing

Notice here we create an empty list clean_desc, then use a for loop to go through the text line by line, setting it to lower case, removing punctuation and special chars, and appending it to the list. Then we replace the text column with the data in the clean_desc list.

Stop Words

Stop words are the foremost common words like “the” and “of.” Removing them from the e-mail text allows the more relevant frequent words to square out. Removing stop words may be a common technique! Some Python libraries like NLTK come pre-loaded with a listing of stop words, but it’s easy to form one from scratch.

stop_words = ['is','you','your','and', 'the', 'to', 'from', 'or', 'I', 'for', 'do', 'get', 'not', 'here', 'in', 'im', 'have', 'on', 're', 'new', 'subject']

Notice I include some email-related words like “re” and “subject.” it’s up to the analyst to see what words should be included or excluded. Sometimes it’s beneficial to incorporate all words!

Construct the Word Could

Conveniently there’s a Python library for creating word clouds. It will be installed using pip.

pip install wordcloud

When constructing the word cloud, it’s possible to line several parameters like height and width, stop words, and max words. it’s even possible to shape it rather than displaying the default rectangle.

wordcloud = WordCloud(width = 800, height = 800, background_color = 'black', stopwords = stop_words, max_words = 1000
                     , min_font_size = 20).generate(str(df1['text']))
#plot the word cloud
fig = plt.figure(figsize = (8,8), facecolor = None)

To save and display the word cloud. matplotlib and show() are used. Regardless of it being spam, it is the result of all records.

word cloud | Natural language processing

Push the exercise further by splitting the info frame and making two-word clouds to assist analyze the difference between keywords employed in spam email and not spam email.

Project 2: Spam Detection

Consider it as a binary classification problem since an email can either be spam denoted by “1” or not spam denoted by “0”. I would like to create a machine learning model that may identify whether or not an email may be spam. I’m visiting use the Python library Scikit-Learn to explore tokenization, vectorization, and statistical classification algorithms.

spam detection | Natural language processing

Import Dependencies

Import the Scikit-Learn functionality we’d like to rework and model the info. I’ll use CountVectorizer, train_test_split, ensemble models, and a pair of metrics.

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.metrics import classification_report, accuracy_score

Transforming Text to Numbers

In project 1, the text was cleaned. once you take a look at a word cloud, notice it’s primarily single words. The larger the word, the upper it’s frequency. to stop the word cloud from outputting sentences, the text goes through a process called tokenization. it’s the method of breaking down a sentence into individual words. The individual words are called tokens.

Using SciKit-Learn’s CountVectorizer(), it’s easy to rework the body of text into a sparse matrix of numbers that the pc can pass to machine learning algorithms. To simplify the concept of count vectorization, imagine you have got two sentences:

The dog is white

The cat is black

Converting the sentences to a vector space model would transform them in such a way that looks at the words in all sentences, and then represents the words in the sentence with a number.

The dog cat is white black

The dog is white = [1,1,0,1,1,0]
The cat is black = [1,0,1,1,0,1]
We can show this using code as well. I’ll add a third sentence to show that it counts the tokens.
#list of sentences
text = ["the dog is white", "the cat is black", "the cat and the dog are friends"]
#instantiate the class
cv = CountVectorizer()
#tokenize and build vocab
#transform the text
vector = cv.transform(text)

The sparse matrix of word counts.

Notice within the last vector, you’ll be able to see a 2 since the word “the” appears twice. The CountVectorizer counts the tokens and allows me to construct the sparse matrix containing the transformed words to numbers.

Bag of Words Method

Because the model doesn’t take word placement under consideration and instead mixes the words up as if they were tiles in a very scrabble game, this is often called the bag of words method. I’m visiting to create the sparse matrix, then split the information using SK-learn train_test_split().

text_vec = CountVectorizer().fit_transform(df['text'])
X_train, X_test, y_train, y_test = train_test_split(text_vec, df['spam'], test_size = 0.45, random_state = 42, shuffle = True)

Notice I set the sparse matrix text_vec to X and the df[‘spam’] column to Y. I shuffle and take a test size of 45%.

The Classifier

It is highly recommended to experiment with several classifiers and determine which one works best for this scenario. during this example, I’m using the GradientBoostingClassifier() model from the Scikit-Learn Ensemble collection.

classifier = ensemble.GradientBoostingClassifier(
   n_estimators = 100, #how many decision trees to build
   learning_rate = 0.5, #learning rate
   max_depth = 6

Each algorithm will have its own set of parameters you’ll be able to tweak. that’s called hyper-parameter tuning. undergo the documentation to find out more about each of the parameters utilized in the models.

Generate Predictions

Finally, we fit the info, call predict and generate the classification report. Using classification_report(), it’s easy to create a text report showing the most classification metrics., y_train)
predictions = classifier.predict(X_test)
print(classification_report(y_test, predictions))

classification report | Natural language processing

Classification Report

Notice our model achieved 97% accuracy.

Project 3: Sentiment Analysis

Sentiment Analysis is additionally a classification problem of sorts. The text is basically visiting to reflect a positive, neutral, or negative sentiment. that’s noted because of the polarity of the text. it’s also possible to determine and account for the subjectivity of the text! There are a lot of great resources that cover the speculation behind sentiment analysis.

Instead of building another model, this project uses a straightforward, out-of-box tool to research sentiment called TextBlob. I’ll use TextBlob to feature sentiment columns to the DataFrame so they are often analyzed.

sentiment analysis


What is TextBlob?

Built on top of NLTK and pattern, the TextBlob library for Python 2 and three tries to simplify several text processing tasks. It provides tools for classification, part-of-speech tagging, phrase extraction, sentiment analysis, and more. Install it using pip.

pip install -U textblob
python -m textblob.download_corpora

TextBlob Sentiment

Using the sentiment property, TextBlob returns a named tuple of the shape Sentiment(polarity, subjectivity). Polarity may be afloat within the range [-1.0, 1.0] where -1 is that the most negative and 1 is that the most positive. Subjectivity could be afloat within the range [0.0, 1.0] where 0.0 is extremely objective and 1.0 is extremely subjective.

blob = TextBlob("This is a good example of a TextBlob")
#Sentiment(polarity=0.7, subjectivity=0.6000000000000001)

Applying TextBlob

Using list comprehensions, it’s easy to load the text column as a TextBlob, so create two new columns to store the Polarity and Subjectivity.

#load the descriptions into textblob
email_blob = [TextBlob(text) for text in df['text']]
#add the sentiment metrics to the dataframe
df['tb_Pol'] = [b.sentiment.polarity for b in email_blob]
df['tb_Subj'] = [b.sentiment.subjectivity for b in email_blob]
#show dataframe

TextBlob makes it super simple to come up with a baseline sentiment score for polarity and subjectivity. To push this exerciser further, see if you’ll be able to add these new features to the spam detection model to extend the accuracy!


Even though linguistic communication Processing can look like an intimidating topic, the foundational pieces don’t seem to be that onerous to understand. Many libraries make it easy to start exploring data science and NLP. Completing these three projects:

Word Cloud

Spam Detection

Sentiment Analysis

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *