Kajal Kumari — August 27, 2021
Beginner Data Cleaning NLP Programming Python Text Unstructured Data

This article was published as a part of the Data Science Blogathon

Overview

In today’s world, one of the biggest sources of information is text data, which is unstructured in nature. Finding customer sentiments from product reviews or feedbacks, extracting opinions from social media data are a few examples of text analytics.

Finding insights from text data is not as straightforward as structured data and it needs extensive data pre-processing. The algorithm that we have explored so far, such as regression, classification, or clustering can be applied to text data only when the data is cleaned and prepared.

In this article, we will use a dataset that is available at

https://www.kaggle.com/amitkumardas/sentiment-train

for building a classification model to classify sentiment.

The data consists of sentiments expressed by users on the various movies. Here each comment is a record, which is either classified as positive or negative.

Sentiment Classification

This dataset described in the previous paragraph contains review comments on several movies. Comments in the dataset are already labelled as either positive or negative. The dataset contains the following two fields separated by a tab character.

1. text:- Actual review comment

2. sentiment:- Positive sentiments are labelled as 1 and negative sentiments are labelled as 0.

Now in this article will discuss few functions of preprocessing of text dataset.

Text Pre-Processing

Unlike structures data, features are not explicitly available in text data. Thus we need to use a process to extract features from the text data. One way is to consider each word as a feature and find a measure to capture whether a word exists or does not exist in a sentence. This is called the bag-of-words(BoW) model. That is each sentence is treated as a bag of words. Each sentence is called a document and the collection of all documents is called corpus.

This is a list of preprocessing functions that can perform on text data such as:

  1. Bag-of_words(BoW) Model

  2. creating count vectors for the dataset

  3. Displaying Document Vectors

  4. Removing Low-Frequency Words

  5. Removing Stop Words

  6. Distribution of words Across Different sentiment

we will discuss these preprocessing functions in the following subsections.

 Bag-of_words(BoW) Model

The first step in creating a Bow Model is to create a dictionary of all the words used in the corpus. At this stage, we will not worry about grammar and only the occurrence of the words is captured. Then we will convert each document to a vector that represents words available in the documents. There are three ways to identify the importance of words in the BoW Model:

  • Count Vector Model

  • Term Frequency Vector Model

  • Term Frequency-Inverse Document Frequency(TF-IDF) Model

we will discuss these vector models in the following subsections.

a) Count Vector Model

Consider the following two documents:

  • Document 1 (positive sentiments): I really really like IPL.

  • Document 2 (negative sentiments): I never like IPL.

Note: IPL stands for Indian Premier League.

The complete vocabulary set for the above two documents will have words such as I, really, never, like, IPL. These words can be considered as features(x1 through x5). For creating count words, we count the occurrence of each word in the document as shown in the table below. the y column in the table indicated the sentiments of the sentence :1 for positive and 0 for negative sentiment.

count vector model | Text Preprocessing techniques

b) Term Frequency Vector Model

Term frequency (TF) vector is calculated for each document in the corpus and is the frequency of each term in the document. It is given by,

Term Frequency(TFi) = (Number of occurrences of a word i in the document)/(Total number of words in the document)

Where TFi is the term frequency for word. TF representation for the two documents is shown in the table.

TFIDF model | Text Preprocessing techniques

c) Term Frequency – Inverse Document Frequency(TF-IDF)

TF-IDF measures how important a word is to a document in the corpus. The importance of a word increases proportionally to the number of times a word appears in the document but is reduced by the frequency of the word present in the corpus. TF-IDF for word I in the documents is given by

TF-IDFi = TFi*ln(1+(N/Ni))

where N is the total number of documents in the corpus, Ni is the number of documents that contains word i.

The IDF value for each word for the above two documents is given in the below table.

IDF

The TF-IDF values for the two documents are shown in the below table.

TFIDF2 | Text Preprocessing techniques

Creating Count Vectors for Dataset

Each document in the dataset needs to be transformed into TF or TF-IDF vectors sklearn.feature_extraction.text module provides classes for creating both TF and TF-IDF vectors from text data. We will use CountVectorizer to create count vectors.

We use the following code to process and create a dictionary of all words present across all the documents. The dictionary will contain all unique words across the corpus. And each word in the dictionary will be treated as a feature.

from sklearn.feature_extraction.text import CountVectorizer
count_vectorize = CountVectorizer()
feature_vector =  count_vectorize.fit(train_data.Text)
features = feature_vector.get_feature_names()
print("total number of features: ", len(features))

number of features

The total number of features and unique words in the corpus is 1903. The random sample of features can be obtained by using the following random.sample() method.

import random
random.sample(features,10)

sample features | Text Preprocessing techniques

using the above dictionary, we can convert all the documents in the dataset to count vectors using the transform() method of count vectorizer:

train_ds_features = count_vectorize.transform(train_data.Text)
type(train_ds_features)

The dimension of the dataframe train_ds_features, which contains the count vectors of all the documents, is given by the shape variable of the dataframe.

train_ds_features.shape

train data shape | Text Preprocessing techniques

After converting the document into a vector, we will have a sparse matrix with 1903 features or dimensions. Sparse matrix representations stores only the non-zero values and their index in the vector. To know how many actual non-zero values are present in the matrix, we can use the getnnz() method on the dataFrame.

train_ds_features.getnnz()

The output is 53028. Computing the proportion of non-zero values with respect to zero values in the matrix can be obtained by dividing the number of non-zero values by the dimension of the matrix.

print("Density of the matrix: ", train_ds_features.getnnz()*100/
               (train_ds_features.shape[0]*train_ds_features.shape[1]))

output is :

The density of the matrix: 0.4916280092607186

The matrix has less than 1% non-zero values, that is, more than 99% values are zero values. That is a very sparse representation.

Displaying Document Vectors

To visualize the count vectors, we will convert this matrix into a Dataframe and set the column names to the actual feature names. The following commands are used for displaying the count vector and also print the first records.

train_ds_df = pd.DataFrame(train_ds_features.todense())
train_ds_df.columns = features
train_data[0:1]

document vector | Text Preprocessing techniques

Now select all the columns as per the words in the sentence and print below.

train_ds_df[['ok','brokeback', 'mountain', 'is', 'such', 'horrible', 'movie']][0:1]

train data

yes, the features in the count vector are appropriately set to 1. The vector represents the sentence “Ok brokeback mountain is such a horrible movie”.

 Removing Low_Frequency Words

One of the challenges of dealing with text is the number of words of features available in the corpus is too large. The number of features could easily go over tens of thousands. The frequency of each feature or word can be analyzed using the histogram. To calculate the total occurrence of each feature or word, we will use np.sum() method. The histogram showed that a large number of features have very rare occurrences.

features_counts = np.sum(train_ds_features.toarray(),axis=0)
features_counts_df = pd.DataFrame(dict(features = features, counts = features_counts))
plt.figure(figsize=(8,6))
plt.hist(features_counts_df.counts, bins=50, range=(0,2000))
plt.xlabel("Frequency of words")
plt.ylabel('Density')

remove low frequency words | Text Preprocessing techniques

To find the rare words in the dictionary, for example, words that are present in any one of the documents, we can filter the features by a count equal to 1.

len(features_counts_df[features_counts_df.counts==1])

output is 233.

There are 233 words that are present only once across all documents in the corpus. These words can be ignored. we can restrict the number of features by setting max_features parameters to 1000 while creating the count vectors. Now print the first 15 words and their count in descending order.

count_vectorizer =  CountVectorizer(max_features=1000)
feature_vector = count_vectorizer.fit(train_data.Text)
features = feature_vector.get_feature_names()
train_ds_features = count_vectorizer.transform(train_data.Text)
features_counts =  np.sum(train_ds_features.toarray(),axis=0)
features_counts = pd.DataFrame(dict(features = features, counts = features_counts))
features_counts.sort_values('counts', ascending=False)[0:15]

feature and count | Text Preprocessing techniques

It can be noticed that the selected list of features contains words like the, is, was, and etc. These words are irrelevant in determining the sentiment of the document. These words are called stops words and can be removed from the dictionary. This will reduce the number of features further.

 Removing Stop Words

sklearn.feature_extraction.text provides a list of pre-defined stop words in English, which can be used as a reference to remove the stop words from the dictionary, that is feature set.

from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS
print("Few stop words: ", list(my_stop_words)[0:10])

Output:-

Few stop words: [‘nowhere’, ‘after’, ‘whether’, ‘full’, ‘back’, ‘eg’, ‘itself’, ‘seem’,

‘interest’, ‘upon’]

Also, additional stop words can be added to the list for removal. For example, the movie names and the words “movie” itself can be a stop word in this case. These words can be added to the existing list of stop words for removal. For example

my_stop_words = text.ENGLISH_STOP_WORDS.union(['harry', 'potter','code','vinci',
                                                              'da','harry','mountain','movie','movies'])

Distribution of words Across Different sentiment

The word which has a positive or negative meaning occur across documents of different sentiments. This could give an initial idea of how these words can be good features for predicting the sentiment of documents. For example, let us consider the word awesome.

train_ds_df = pd.DataFrame(train_ds_features.todense())
train_ds_df.columns = features
train_ds_df['Sentiment'] = train_data.Sentiment
sn.barplot(x = 'Sentiment',y = 'awesome', data = train_ds_df, estimator= sum)

As shown in the figure the word awesome appears mostly in positive sentiment documents.

sentiment

How about a natural word like really?

sn.barplot(x = 'Sentiment',y = 'really', data = train_ds_df, estimator= sum)

As shown in the figure the word really appears mostly in positive and negative sentiment documents.

really

How about the word hate?

sn.barplot(x = 'Sentiment',y = 'hate', data = train_ds_df, estimator= sum)

As shown in the figure the word hate occurs mostly in negative sentiment than positive sentiments. This absolutely makes sense.

hate

This gives us an initial idea that the words awesome and hate could be a good feature in determining the sentiments of the document.

Conclusion

In this article, Text data is unstructured data and needs extensive preprocessing before applying models. Documents or sentences can be tokenized into unigrams or n-grams for building features. The documents can be represented as vectors with words or n-grams as features. The vectors can be created using simple counts, TF(Term frequency) or TF-IDF values. A robust set of features can be created by removing stop words and applying stemming or lemmatization. The number of features can also be limited by selected only features with higher frequencies.

About the Author

Hi, I am Kajal Kumari. I have completed my Master’s from IIT(ISM) Dhanbad in Computer Science & Engineering. As of now, I am working as Machine Learning Engineer in Hyderabad. You can also check out few other blogs that I have written here.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *