Shivani Sharma — September 13, 2021
Advanced Classification Libraries Machine Learning NLP Project Python Unstructured Data

This article was published as a part of the Data Science Blogathon

Introduction

Let’s look at a practical application of the supervised NLP fastText model for detecting sarcasm in news headlines. About 80% of all information is unstructured, and text is one of the most common types of unstructured data. Due to its chaotic nature, analyzing, understanding, organizing, and sorting textual information becomes complex and time-consuming tasks. This is where NLP and text classification comes in.

Text classification is a machine learning technique used to fragment them into categories. Using classifier models, companies can automatically structure all kinds of text, from emails, legal documents, social media posts, chatbot messages, survey results, etc. This saves time spent analyzing information, automates business processes, and makes data-driven business decisions.

fastText is a popular open-source text classification library that was published in 2015 by the Facebook Artificial Intelligence Research Lab. The company also provides models: English word vectors (pre-trained in English web crawl and Wikipedia) and Multi-lingual word vectors (trained models for 157 different languages), which allow the creation of Supervised and Unsupervised learning algorithms for obtaining vector representations of words. In this article, we’ll look at how it can be used to categorize news headlines.

Loading the fastText Library

import pandas as pd
import fasttext
from sklearn.model_selection import train_test_split
import re
from gensim.parsing.preprocessing import STOPWORDS
from gensim.parsing.preprocessing import remove_stopwords
pd.options.display.max_colwidth = 1000

Data for the Project

A dataset is a collection of news article headlines and their annotation as sarcasm (articles from the news outlet The Onion ) and non-sarcasm (from HuffPost ).

Datalink: https://www.kaggle.com/rmisra/news-headlines-dataset-for-sarcasm-detection

Variables of the Data

  • is_sarcastic: 1 if the title is sarcastic, otherwise 0

  • headline: news article title

  • article_link: link to the original article

# Loading data
# Checking the number of variables and observations
df_headline.shape
(26709, 3)
# Display header examples
df_headline.head (3)
top line data | Text Classification Using fastText

 

      # Display the number of sarcastic and non-sarcastic articles in the dataset and their percentage
df_headline.is_sarcastic.value_counts()
0    14985
1    11724
df_headline.is_sarcastic.value_counts(normalize=True)
0    0.561047
1    0.438953

Here are some examples of sarcastic and non-sarcastic headlines:

df_headline[df_headline['is_sarcastic']==1].head(3)
is_sarcastic data

 

df_headline[df_headline['is_sarcastic']==0].head(3)

 

is_not sarcastic data | Text Classification Using fastText

Text preprocessing

One of the first steps to improve model performance is to use simple text preprocessing. Before we start building the classifier, we need to prepare the text: bring all words to lower case, remove punctuation, special characters, and numbers. To do this, let’s create a cleanup function and apply it to a variable headline.

# Create a text cleanup function

def clean_text (text):
     text = text.lower ()
     text = re.sub (r '[^  sa-zA-Z0-9 @  []]', '', text) # Removes punctuation
     text = re.sub (r ' w *  d +  w *', '', text) # Remove digits
     text = re.sub (' s {2,}', "", text) # Removes unnecessary spaces
     return text

# Apply it to the title

df_headline['headline'] = df_headline['headline'].apply(clean_text)

Separation of data into training and test 

Before we start training the model, we need to split the data like this. Most often, 80% of the information is used for training a model (depending on the amount of data, the sample size can vary) and 20% for testing (accuracy verification).

# Divide data into training and text

train, test = train_test_split(df_headline, test_size = 0.2)

Creating a text file

Next, we need to prepare files in the format txt. The default file format should include __label__

# Create text files for training the model with label and text
with open ('train.txt', 'w') as f:
     for every_text, every_lbl in zip (train ['headline'], train ['is_sarcastic']):
         f.writelines (f '__ label __ {every_lbl} {every_text}  n')
with open ('test.txt', 'w') as f:
     for every_text, every_lbl in zip (test ['headline'], test ['is_sarcastic']):
         f.writelines (f '__ label __ {every_lbl} {every_text}  n')
# Display what our training data now looks like
!head -n 10 train.txt

Building the model using fastText

To train the model, you need to set the fastText input file and its name:

# First model without hyperparameter optimization
model1 = fasttext.train_supervised ('train.txt')
# Create a function to display the training results of the model
def print_results (sample_size, precision, recall):
     precision = round (precision, 2)
     recall = round (recall, 2)
     print (f '{sample_size =}')
     print (f '{precision =}')
     print (f '{recall =}')
# Apply the function
print_results(*model1.test('test.txt'))
sample_size=5342
precision=0.85
recall=0.85

The results, while not perfect, look promising.

Optimizing hyperparameters of fastText

Finding the best hyperparameters manually can be time-consuming. By default, the fastText model includes each training example only five times during training, which is quite small considering that we have only 12,000 examples in our set. The number of views for each example (also known as the number of epochs) can be increased through manual optimization epoch:

# Second model with 25 epochs
model2 = fasttext.train_supervised('train.txt', epoch=25)
print_results(*model2.test('test.txt'))
sample_size=5342
precision=0.83
recall=0.83

As you can see, the accuracy of the model has not increased. Another way to change the speed of the process is to increase (or decrease) the learning rate of the algorithm. This corresponds to how much the model changes after each example is processed. A learning rate of 0 would mean that the model does not change at all and therefore does not learn anything. Good learning rates are in the range of 0.1 – 1.0. We can also manually optimize this hyperparameter with an argument lr:

# Third model with 10 epochs and 1 learning rate

model3 = fasttext.train_supervised('train.txt', epoch=10, lr=1.0)
print_results(*model3.test('test.txt'))
sample_size=5342
precision=0.83
recall=0.83

Finally, we can improve the performance of the model by using bigrams of words rather than just unigrams. This is especially important for classification tasks where word order is important, for example, analyzing sentiments, defining criticism, sarcasm, etc. For this, we will include an argumentwordNgrams equal to 2 in the model.

model4 = fasttext.train_supervised('train.txt', epoch=10, lr=1.0, wordNgrams =2)
print_results(*model4.test('test.txt'))
sample_size=5342
precision=0.86
recall=0.86

Thanks to this sequence of steps, we were able to go from 86% accuracy:

  • preprocessing of text;

  • changing the number of epochs (using an argument epoch, standard range [5 – 50]);

  • changing the learning rate (using an argument lr, standard range [0,1 – 1,0]);

  • using n-grams of words (using argument wordNgrams, standard range [1-5]).

You can also adapt the search for hyperparameters through the evaluation of a specific label by adding an argument autotune Metric:

model5 = fasttext.train_supervised('train.txt', autotuneValidationFile='test.txt')
print_results(*model5.test('test.txt'))
sample_size=5342
precision=0.87
recall=0.87

The fastText auto-tuning feature optimizes hyperparameters to obtain the highest F1. To do this, you need to include the model argument autotune ValidationFileand test dataset:

model6 = fasttext.train_supervised('train.txt', autotuneValidationFile='test.txt', autotuneMetric="f1:__label__1")
print_results(*model6.test('test.txt'))
sample_size=5342
precision=0.87
recall=0.87

Let’s save the model results and create a function to classify the new data:

# Save the model with optimized hyperparameters and the highest accuracy

model6.save_model('optimized.model')

fastText is also capable of compressing the model to produce a much smaller file, sacrificing little performance through quantification.

model.quantize(input='train.txt', retrain=True)

Results for the model using fastText Library

We can also simulate new data and test models against real headers. This will use the News Aggregator Dataset (https://www.kaggle.com/uciml/news-aggregator-dataset) from Kaggle:

# Loading data
df_headline_test = pd.read_csv ('uci-news-aggregator.csv')
# Display headers
df_headline_test.TITLE.head(3)
title.head

 

Let’s apply the text classification function to the new headings and create variables with the predicted label and its probability:

# Prepare new data for classification
df_headline_test ['TITLE'] = df_headline_test ['TITLE']. apply (clean_text)
# Create a function to classify text
def predict_sarcasm (text):
     return model.predict (text, k = 1)
# Transform variables into a convenient format
df_headline_test['predict_score'] = df_headline_test.TITLE.apply(predict_sarcasm)
df_headline_test['predict_score'] = df_headline_test['predict_score'].astype(str)
df_headline_test[['label','probability']] = df_headline_test.predict_score.str.split(" ",expand=True)
df_headline_test['label'] = df_headline_test['label'].str.replace("(", '')
df_headline_test['label'] = df_headline_test['label'].str.replace(")", '')
df_headline_test['label'] = df_headline_test['label'].str.replace("__", ' ')
df_headline_test['label'] = df_headline_test['label'].str.replace(",", '')
df_headline_test['label'] = df_headline_test['label'].str.replace("'", '')
df_headline_test['label'] = df_headline_test['label'].str.replace("label", '')
df_headline_test['probability'] = df_headline_test['probability'].str.replace("array", '')
df_headline_test['probability'] = df_headline_test['probability'].str.replace("(", '')
df_headline_test['probability'] = df_headline_test['probability'].str.replace(")", '')
df_headline_test['probability'] = df_headline_test['probability'].str.replace("[", '')
df_headline_test['probability'] = df_headline_test['probability'].str.replace("]", '')
# Remove unnecessary variable
df_headline_test = df_headline_test.drop (columns = ['predict_score'])
# Display the number of predicted sarcastic and non-sarcastic headlines
df_headline_test.label.value_counts(normalize=True)

OUTPUT

0 0.710827

1 0.289173

We can see that 28% of the headlines were classified as sarcasm.

Conclusion

In conclusion, it should be noted that fastText is not one of the most recent developments in the classification of texts (the library was published in 2015). At the same time, this is a good basis for beginners: when performing NLP classification of texts of any complexity, the model has a significant advantage due to its ease of use, speed of learning, and automatic tuning of hyperparameter.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

About the Author

Our Top Authors

  • Analytics Vidhya
  • Guest Blog
  • Tavish Srivastava
  • Aishwarya Singh
  • Aniruddha Bhandari
  • Abhishek Sharma
  • Aarshay Jain

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *