End-to-End NLP Project on Quora Duplicate Questions Pair Identification

Raghav Agrawal 08 Aug, 2023 • 29 min read

Introduction

Automation today is taking place in each sector. Every industry acquires different technological innovations and products to take charge of the market. And one of the leading technology which enjoys more attraction is NLP (Natural Language Processing). It is a technology that helps computers understand, interpret, and respond to a human language where many products in the market involve direct interaction of humans to machines in an intelligent way to handle multiple tasks. Let’s get started with our NLP project on Quora questions!

To understand how the industry uses NLP to create fantastic applications to tackle real-world problems, we will walk through one excellent application of NLP in this post. The project will teach us various things, and the learning objectives are outlined below.

Complete NLP Life-cycle with Machine learning models
how to Analyze text and Text preprocessing using NLP techniques
Text Data visualization to find hidden patterns
Advance Feature Engineering for NLP
Dimensionality Reduction necessity

This article was published as a part of the Data Science Blogathon.

Problem Statement and Objective
Prerequisites for Project Development
Dataset Description
Project Development Overflow
Basic Data Analysis
Feature Engineering
Exploratory Analysis of Newly Added Features
Machine Learning Model Creation
Optimizing the Current Model Performance
Text preprocessing
Advance Feature Engineering
EDA of Newly created Features
Dimensionality Reduction
Machine Learning Modeling Part
Deploy the Machine Learning model over Cloud
Deploy to Heroku
Conclusion
Resources

Problem Statement and Objective

Human language is cumbersome, and we speak a single sentence differently. Today there is a lot of online content, but most of the data we get is repeated multiple times. It confuses the readers and creates fear to interpret and remember the answers. If you search for anything on the search engine, you get multiple listed FAQs that answer the same thing differently on grouping. The FAQs and the search engine display one block of similar searches, indicating the multiple searches from people in different ways that you are looking for. But in the end, the resultant articles and blogs are the same.

So our problem statement for the article is on Quora’s question-answering platform. It is a popular online platform where people worldwide can ask questions and receive answers from a diverse community of users. The topics range from science, technology, and history to personal development, health, and entertainment. Quora was founded in 2009 and has become a go-to destination for knowledge seekers and experts. The website features a voting system that allows users to upvote or downvote answers based on their quality, which helps to ensure that the most helpful and accurate responses are prominently displayed.

Objective: Today, these businesses are looking for a solution that could be integrated into their system and detect similar queries, saving human time and replicating a similar answer for that question. So, our Objective of the article is to develop a generalized model to understand the use of NLP and machine learning to detect similar queries.

Prerequisites for Project Development

Python – You should be familiar with Python programming with its syntax and indentations.
Pandas – Data analysis is essential before building a model. Pandas is a Python library that helps to analyze a high volume of data with straightforward functions and methods.
Matplotlib – Graphs help us to understand the data in a better way, so if you hold knowledge of Python visualizing libraries, then it will help to get the solution quickly.
Sklearn – You should be familiar with machine learning and feature engineering because we will use them to extract different features and train machine learning.

Dataset Description

The dataset we will use is a very popular dataset that Quora hosted in one of Kaggle’s competitions which has prize money of $25000. The dataset contains only 5 columns, of which two columns contain 2 different questions, 2 column contains the respective question id, and the last column indicates the target variable whose value is in binary format (1 means duplicate, and 0 means non-duplicate)

Dataset you can easily download from Kaggle, or for practice purposes, you can create a new Kaggle notebook for the corresponding dataset.

Project Development Overflow

It is good practice to clear the mindset and general project flow steps. so below are the simple steps we will follow in the article to complete the project. And by the time I hope you have downloaded the dataset and created an empty Python Notebook.

Basic Data Analysis
Feature Engineering
Model Development
Optimize the model to increase performance
Web application creation
deployment over cloud

Basic Data Analysis

EDA is a process that helps to understand the given dataset better and more practically. It includes multiple techniques and methods to dig deep into the data to extract the hidden patterns and relationships between the different entities in the dataset. we will perform the EDA in some easy steps.

Before performing data preprocessing, and Exploratory analysis, it’s essential to know the irregularities of our data. It helps us to drive our decision for further analysis and modeling.

1. Load the Dataset

The first task is to load the data into the notebook and import all necessary libraries.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings('ignore')

data = pd.read_csv("train.csv")

#Data is to large so take the small sample for analysis
new_df = df.sample(30000,random_state=2)

data.head()

2. Check for Null Values

We have a large corpus of data, but most importantly, it should not be empty. So NULL values must be checked before moving forward.

new_df.isnull().sum()

3. Check for Duplicate Values

It is essential to find out if we have a data redundancy problem. The question in the same column are present multiple times or not?

new_df.isnull().sum()

So the data is non-redundant, but if you plot the distribution of the target variable, then you will find the data is more inclined towards 0 means we have a little bit of a class imbalance problem with our data.

# Distribution of duplicate and non-duplicate questions

print(new_df['is_duplicate'].value_counts(normalize = True))
#print((new_df['is_duplicate'].value_counts()/new_df['is_duplicate'].count())*100)
new_df['is_duplicate'].value_counts().plot(kind='bar')

4. Check for Repeated Questions

There are no similar entries in a single question column, but there may be repetition from the total number of questions (question-1, question-2). To determine how many questions are unique and how many are repeated, we will merge the questions from both columns into a single list.

# Repeated questions

qid = pd.Series(data['qid1'].tolist() + data['qid2'].tolist())
print('Number of unique questions',np.unique(qid).shape[0])
x = qid.value_counts()>1
print('Number of questions getting repeated',x[x].shape[0])

Histogram for Repeated Questions

We can create a histogram that represents the question count and their sum. In simple words, how many questions occur 1 time, 2 times, and so on? So the count of questions is on the y-axis, and the number is on the x-axis.

# Repeated questions histogram

plt.hist(qid.value_counts().values,bins=160)
plt.yscale('log')
plt.show()

Now we get some idea about the data, like how it looks and what irregularities it holds that we must overcome. So after this, if you train a machine learning model like Random Forest or XGboost by vectorizing the question, then approximately you will get 74 percent of accuracy. So we aim to use feature engineering techniques and try to improve the model performance.

Feature Engineering

Feature engineering is a classic way of adding new features to the data that dominates to predict output variables and improve the model’s accuracy. A crucial feature creates a direct impact on the model. Feature engineering consists of transformation, scaling, feature extraction, feature encoding, EDA, etc.

We will add 7 more features to our existing dataset. The bag of words model for questions 1 and 2 questions 2 will produce different features that will be passed to the Machine learning model after Exploratory analysis.

1. Question Length

The size of the question is a critical feature because when we vectorize it, the question gets split by words, so having the length feature is good. The length we are having is the character-wise length. So it will create 2 new features for the length of questions 1 and 2.

new_df['q1_len'] = new_df['question1'].str.len() 
new_df['q2_len'] = new_df['question2'].str.len()

2. Number of Words

The number of words in both questions is another feature that should impact the model performance. So, it will add 2 new features for questions 1 and 2. To add the feature, split the sentence with space and extract the length of the list.

new_df['q1_num_words'] = new_df['question1'].apply(lambda row: len(row.split(" ")))
new_df['q2_num_words'] = new_df['question2'].apply(lambda row: len(row.split(" ")))
new_df.head()

3. Common Words

Another feature is to know how many common words there are in both questions. It helps identify the similarity between both questions. Calculating where you only need to apply the intersection between both questions is simple. For this, we find the number of unique words in both questions and apply the set intersection to the set length.

def common_words(row):
    w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
    w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
    return len(w1 & w2)

new_df['word_common'] = new_df.apply(common_words, axis=1)
new_df.head()

4. Total Words

The sum of the total number of unique words in each question. In simple terms, find the number of unique words in both questions and return their sum.

def total_words(row):
    w1 = set(map(lambda word: word.lower().strip(), row['question1'].split(" ")))
    w2 = set(map(lambda word: word.lower().strip(), row['question2'].split(" ")))    
    return (len(w1) + len(w2))

new_df['word_total'] = new_df.apply(total_words, axis=1)
new_df.head()

5. Words Share

It is one exciting feature and simple to add. To calculate, divide the common words by the total number of words.

new_df['word_share'] = round(new_df['word_common']/new_df['word_total'],2)
new_df.head()

Exploratory Analysis of Newly Added Features

We have introduced some new features in the dataset, and it is an excellent time to analyze the relationship and their spread with the target variable.

1. Distribution of Questions

We will analyze the length of the question, like the average length of each question in 1 and 2. Minimum and maximum number of characters. So plot the distribution plot for questions 1 and 2.

sns.displot(new_df['q1_len'])
print('minimum characters',new_df['q1_len'].min())
print('maximum characters',new_df['q1_len'].max())
print('average num of characters',int(new_df['q1_len'].mean()))

2. Distribution of the Number of Words

Generate the same graphical analysis for the average, minimum, or maximum number of words in each question we have done.

sns.displot(new_df['q1_num_words'])
print('minimum words',new_df['q1_num_words'].min())
print('maximum words',new_df['q1_num_words'].max())
print('average num of words',int(new_df['q1_num_words'].mean()))

3. Common Words Analysis

We will plot the distribution of common words in questions 1 and 2. In this distribution plot, we will have a separate curve of non-duplicate and vice-versa

# common words
sns.distplot(new_df[new_df['is_duplicate'] == 0]['word_common'],label='non duplicate')
sns.distplot(new_df[new_df['is_duplicate'] == 1]['word_common'],label='duplicate')
plt.legend()
plt.show()

The blue one is a non-duplicate, and the orange curve, respectively. According to the graph, non-duplicate predictions are more likely if the number of common terms exceeds 4.

4. Total Words Analysis

The same analysis for total words against target variable unique entries.

# total words
sns.distplot(new_df[new_df['is_duplicate'] == 0]['word_total'],label='non duplicate')
sns.distplot(new_df[new_df['is_duplicate'] == 1]['word_total'],label='duplicate')
plt.legend()
plt.show()

Records will be duplicated if the total word count is between 0 and 20, but if it is more significant than 40, the model gives high weightage to non-duplicates.

5. Words Share Analysis

Plot the distribution plot for duplicate and non-duplicate against the words share column.

# word share
sns.distplot(new_df[new_df['is_duplicate'] == 0]['word_share'],label='non duplicate')
sns.distplot(new_df[new_df['is_duplicate'] == 1]['word_share'],label='duplicate')
plt.legend()
plt.show()

A non-duplicate is likely to occur if the words share value is less than 0.2, but duplication occurs if the word’s share value is more significant than 0.2.

Machine Learning Model Creation

After performing the above EDA, we gain the confidence to keep the features in our dataset and move to the modelling part.

1. Separate the Independent and Dependent features

First, we need to drop the unnecessary columns and pick the columns needed for training and one target feature (dependent). So we will pick questions in different dataframe for vectorizing and other features in another dataframe with the target variable. And concat them after vectorizing.

ques_df = new_df[['question1','question2']]
ques_df.head()

final_df = new_df.drop(columns=['id','qid1','qid2','question1','question2'])
print(final_df.shape)
final_df.head()

2. Vectorizing the Feature

We need to turn the questions into numerical ones because we can’t provide the string to the model. To do this, we employ a variety of feature vectorizing techniques; for the moment, we’ll use a bag of words (BOW). The bow is a technique for extracting characteristics from text input for machine learning algorithms. It displays the text that describes words’ behavior in the corpus, which entails two things. The first is words’ vocabulary (unique words in the corpus added as a new feature), and the second is a way to count the number of known words (represent the word’s presence in that query using binary format).

from sklearn.feature_extraction.text import CountVectorizer
# merge texts
questions = list(ques_df['question1']) + list(ques_df['question2'])

#Apply Bag of words model
cv = CountVectorizer(max_features=3000)
q1_arr, q2_arr = np.vsplit(cv.fit_transform(questions).toarray(),2)

#Create dataFrame for both the questions Feature
temp_df1 = pd.DataFrame(q1_arr, index= ques_df.index)
temp_df2 = pd.DataFrame(q2_arr, index= ques_df.index)
temp_df = pd.concat([temp_df1, temp_df2], axis=1)
temp_df.shape

#Concat vectorize dataframe with our newly added feature Df
final_df = pd.concat([final_df, temp_df], axis=1)
print(final_df.shape)
final_df.head()

The bag of words for each question has generated 3000 features, generating 6000 for both questions. We have 7 newly added features, making 6007 features the independent variables and one target variable dependent.

3. Train – Test Split

For calculating the performance of the model, we need some amount of data that the model has not seen as a test set, so we will split the final dataframe into two parts, training and test set, where 80 percent of data in the train set and 20 percent for the test set.

from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(final_df.iloc[:,1:].values,
                final_df.iloc[:,0].values,test_size=0.2,random_state=1)

4. Train the Machine Learning Models

To determine which model works the best, we will train XGboost, and Random Forest both models.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
y_pred = rf.predict(X_test)
accuracy_score(y_test,y_pred)

from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X_train,y_train)
y_pred = xgb.predict(X_test)
accuracy_score(y_test,y_pred)

5. Analyzing Model Performance

Random forest performs the best, and we can estimate that adding seven more characteristics improves performance by 3 to 3.5 percent.

Optimizing the Current Model Performance

The main work now is increasing the model performance and getting a well-generalized model. therefore far, we have not performed any text preprocessing, so from now we will perform text mining and analyze the text in different ways to generate some advanced features, which we can say is Advance feature engineering. The feature I have found through multiple notebooks available on Kaggle is that the features are represented as Magic. So for this, we will again load the data in a new dataframe because we will clean the text and calculate all features again.

import re
from bs4 import BeautifulSoup

df = pd.read_csv("train.csv")
new_df = df.sample(30000,random_state=2)

Text Preprocessing

The first step is to clean up the text and rectify the dataset by removing the irregularities in regular NLP Projects. As a result, we will do the following text-cleaning processes.

Lowercase: If the text falls into one case, it is simple to vectorize and interpret because the vectorizer considers token and Token to be different words. So we will convert the entire text into lowercase.
String Equivalents: The text contains multiple symbols, so we will replace them with corresponding string words.
Expand Contraction: Contraction is written communication in human language to write words in short form. For example, don’t stands for do not so there are multiple contractions which we need to change to corresponding complete forms.
Remove HTML tags: The text contains some unnecessary HTML tags, so that we will remove them.
Remove Punctuation: Punctuation is unnecessary and does not convey meaning, so it is better to remove them.

Below is the code snippet for the question that performs all the operations while calling a function.

def preprocess(q):
    
    q = str(q).lower().strip()
    
    # Replace certain special characters with their string equivalents
    q = q.replace('%', ' percent')
    q = q.replace('

Add the Past 7 Features

The features we have added above during basic feature engineering will again run that cell to update the values in the dataframe. So I am not adding the past code again; you only need to rerun the code cell to see the upgraded dataframe. And if you have created a new notebook or again working with loading data, copy the old code and run it.

Advance Feature Engineering

Token Based Features

Tokens are chunks of words that unite to form a sentence. A single word is a Token so we will add some of the features at the token level or that will help to calculate some valuable columns.

Cwc min – It represents the number of common words with a minimum number of words in more minor questions.
Cwc max represents the number of common words to a maximum number of words in a larger question.
Csc min – It represents the number of common stop words to the smaller stop word count among two questions.
Csc max represents the number of common stop words to the larger count among two questions.
Ctc min – The number of standard tokens to count more minor tickets among two questions.
Ctc max – The number of standard tokens to count more significant tokens among two questions.
Last word equal – The binary feature includes the value 1 if the last word of both questions is the same, else 0.
The first word equal – If the first word of both questions is equal then the value is 1. Else 0.

from nltk.corpus import stopwords

def fetch_token_features(row):
    
    q1 = row['question1']
    q2 = row['question2']
    
    SAFE_DIV = 0.0001 

    STOP_WORDS = stopwords.words("english")
    
    token_features = [0.0]*8
    
    # Converting the Sentence into Tokens: 
    q1_tokens = q1.split() #words list
    q2_tokens = q2.split()
    
    if len(q1_tokens) == 0 or len(q2_tokens) == 0:
        return token_features

    # Get the non-stopwords in Questions
    q1_words = set([word for word in q1_tokens if word not in STOP_WORDS])
    q2_words = set([word for word in q2_tokens if word not in STOP_WORDS])
    
    #Get the stopwords in Questions
    q1_stops = set([word for word in q1_tokens if word in STOP_WORDS])
    q2_stops = set([word for word in q2_tokens if word in STOP_WORDS])
    
    # Get the common non-stopwords from Question pair
    common_word_count = len(q1_words.intersection(q2_words))
    
    # Get the common stopwords from Question pair
    common_stop_count = len(q1_stops.intersection(q2_stops))
    
    # Get the common Tokens from Question pair
    common_token_count = len(set(q1_tokens).intersection(set(q2_tokens)))
    
    
    token_features[0] = common_word_count / (min(len(q1_words), len(q2_words)) + SAFE_DIV)
    token_features[1] = common_word_count / (max(len(q1_words), len(q2_words)) + SAFE_DIV)
    token_features[2] = common_stop_count / (min(len(q1_stops), len(q2_stops)) + SAFE_DIV)
    token_features[3] = common_stop_count / (max(len(q1_stops), len(q2_stops)) + SAFE_DIV)
    token_features[4] = common_token_count / (min(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
    token_features[5] = common_token_count / (max(len(q1_tokens), len(q2_tokens)) + SAFE_DIV)
    
    # Last word of both question is same or not
    token_features[6] = int(q1_tokens[-1] == q2_tokens[-1])
    
    # First word of both question is same or not
    token_features[7] = int(q1_tokens[0] == q2_tokens[0])
    
    return token_features
    
#call the func and get the list of each row- of len 8
token_features = new_df.apply(fetch_token_features, axis=1)

#add the respective value in the dataframe
new_df["cwc_min"]       = list(map(lambda x: x[0], token_features))
new_df["cwc_max"]       = list(map(lambda x: x[1], token_features))
new_df["csc_min"]       = list(map(lambda x: x[2], token_features))
new_df["csc_max"]       = list(map(lambda x: x[3], token_features))
new_df["ctc_min"]       = list(map(lambda x: x[4], token_features))
new_df["ctc_max"]       = list(map(lambda x: x[5], token_features))
new_df["last_word_eq"]  = list(map(lambda x: x[6], token_features))
new_df["first_word_eq"] = list(map(lambda x: x[7], token_features))

Length Based Features

1. Mean Length

It is an essential but crucial feature where we will find the average of both questions’ length, known as the mean length of both questions. For example, if the first question is 8 characters in length and the second is 6 characters in length, then the average becomes 14 divided by 2 equals 7, so the mean is 7.

2. Absolute Length Difference

The absolute difference between the length of the two questions (length of words)

3. Longest Substring Ratio

The ratio of the length of the longest substring between two questions is divided by the length of the smaller questions. The first thing is you need to find the substring in both questions and determine the longest one and then divide it with a length of a small token sentence.

import distance

def fetch_length_features(row):
    
    q1 = row['question1']
    q2 = row['question2']
    
    length_features = [0.0]*3
    
    # Converting the Sentence into Tokens: 
    q1_tokens = q1.split()
    q2_tokens = q2.split()
    
    if len(q1_tokens) == 0 or len(q2_tokens) == 0:
        return length_features
    
    # Absolute length features
    length_features[0] = abs(len(q1_tokens) - len(q2_tokens))
    
    #Average Token Length of both Questions
    length_features[1] = (len(q1_tokens) + len(q2_tokens))/2
    
    strs = list(distance.lcsubstrings(q1, q2))
    length_features[2] = len(strs[0]) / (min(len(q1), len(q2)) + 1)
    
    return length_features
    
#call the func and add the features to df
length_features = new_df.apply(fetch_length_features, axis=1)

new_df['abs_len_diff'] = list(map(lambda x: x[0], length_features))
new_df['mean_len'] = list(map(lambda x: x[1], length_features))
new_df['longest_substr_ratio'] = list(map(lambda x: x[2], length_features))

Fuzzy Features

These are some features that are generated using the fuzzy-wuzzy library. To get important about this feature, you can recommend this blog written by the founders of the library itself, where in detail, they explain how these features are calculated.

# Fuzzy Features
from fuzzywuzzy import fuzz

def fetch_fuzzy_features(row):
    
    q1 = row['question1']
    q2 = row['question2']
    
    fuzzy_features = [0.0]*4
    
    # fuzz_ratio
    fuzzy_features[0] = fuzz.QRatio(q1, q2)

    # fuzz_partial_ratio
    fuzzy_features[1] = fuzz.partial_ratio(q1, q2)

    # token_sort_ratio
    fuzzy_features[2] = fuzz.token_sort_ratio(q1, q2)

    # token_set_ratio
    fuzzy_features[3] = fuzz.token_set_ratio(q1, q2)

    return fuzzy_features
    
fuzzy_features = new_df.apply(fetch_fuzzy_features, axis=1)

# Creating new feature columns for fuzzy features
new_df['fuzz_ratio'] = list(map(lambda x: x[0], fuzzy_features))
new_df['fuzz_partial_ratio'] = list(map(lambda x: x[1], fuzzy_features))
new_df['token_sort_ratio'] = list(map(lambda x: x[2], fuzzy_features))
new_df['token_set_ratio'] = list(map(lambda x: x[3], fuzzy_features))

EDA of Newly Created Features

We have added nearly 23 new features to our dataset, so before modeling, we want to be confident that these features would dominate the prediction of output variables, so we run some analysis to identify specific patterns.

1. Minimum Variables with Target Variable

The feature we have added as a minimum calculation against the target feature to see how they affect record duplication.

sns.pairplot(new_df[['ctc_min', 'cwc_min', 'csc_min', 'is_duplicate']],hue='is_duplicate')

We can differentiate between duplicate and non-duplicate. In starting maximum, the blue curve dominates, and after that orange curve dominates.

2. Maximum Variable with Target Variable

We have also appended the maximum calculation, so let us plot them against target variables.

sns.pairplot(new_df[['ctc_max', 'cwc_max', 'csc_max', 'is_duplicate']],hue='is_duplicate')

We can see that the blue curve is the most dominant, and the features look helpful in predicting the output.

3. Last Word and First Word Analysis

We plot the first and last word match against the target variable.

sns.pairplot(new_df[['last_word_eq', 'first_word_eq', 'is_duplicate']],hue='is_duplicate')

We can see that if the last word does not match, there is a good chance that the question has not been copied. Similarly, if the first word is not equal, the likelihood of non-duplicate is high.

4. Length-based Feature Analysis

The mean length and the absolute curve do not give much information because both the curves are moving almost together, but the most extended substring feature is beneficial where a blue curve is dominating.

sns.pairplot(new_df[['mean_len', 'abs_len_diff','longest_substr_ratio', 
        'is_duplicate']],hue='is_duplicate')

5. Fuzzy Feature Analysis

All 4 features give a good understanding of the output variables, which can be useful.

sns.pairplot(new_df[['fuzz_ratio', 'fuzz_partial_ratio','token_sort_ratio',
        'token_set_ratio', 'is_duplicate']],hue='is_duplicate')

Dimensionality Reduction

We will employ TSNE (T-distributed stochastic neighbor Embedding), a non-linear unsupervised dimensionality reduction technique for data exploration and visualization. First, we’ll plot the data on the 2D graph. Then, we’ll use it to visualize the data in 3-D, allowing you to see the impact of 15 features on the target variable. Visit the notebook to get the code for the plotly 3-D view.

# Using TSNE for Dimensionality reduction for 15 Features
# (Generated after cleaning the data) to 3 dimention

from sklearn.preprocessing import MinMaxScaler

X = MinMaxScaler().fit_transform(new_df[['cwc_min', 'cwc_max', 'csc_min', 
    'csc_max' , 'ctc_min' , 'ctc_max' , 'last_word_eq', 'first_word_eq' , 
    'abs_len_diff' , 'mean_len' , 'token_set_ratio' , 'token_sort_ratio' ,  
    'fuzz_ratio' , 'fuzz_partial_ratio' , 'longest_substr_ratio']])
y = new_df['is_duplicate'].values

#train
from sklearn.manifold import TSNE

tsne2d = TSNE(
    n_components=2,
    init='random', # pca
    random_state=101,
    method='barnes_hut',
    n_iter=1000,
    verbose=2,
    angle=0.5
).fit_transform(X)

#visualize
x_df = pd.DataFrame({'x':tsne2d[:,0], 'y':tsne2d[:,1] ,'label':y})

# draw the plot in appropriate place in the grid
sns.lmplot(data=x_df, x='x', y='y', hue='label', fit_reg=False, 
            size=8,palette="Set1",markers=['s','o'])

We can see from the graph that the explanation between 0 and 1 is causing the difference and impacting adding the 15 new features to the data. If you view it in 3-d view, the difference will be clearer.

Machine Learning Modeling Part

The data is now ready, and you must repeat the steps above to train the Random Forest and XGboost models for NLP Project. You can rerun the cell or copy and paste the code. Only the difference here is we have 15 more features which total become 6023 by adding 15 more features. The random forest accuracy is approximately 78.7, and XGboost gives 79.2 percent. So by doing this much optimization, we could boost the performance by 2 to 2.5 percent. We believe that the performance will only increase slowly after the model gets the main explanation.

Selecting the Best Model

This is where many engineers and practitioners make the mistake of picking the best model for deployment. So we have two underlined scenarios to consider while selecting between the models.

When the real value is non-duplicate, but the model reports it as such.
When the actual value is duplicated, the model predicts it is not.

So, If you think over both scenarios, the first example, in which the model predicts duplicate, is riskier because the user experience is terrible. So, if you plot the confusion matrix for both models, the Random forest is the model that makes fewer mistakes in the first scenario so we will go with the Random forest for deployment. So use the Pickle library to preserve the model.

import pickle

pickle.dump(rf,open('model.pkl','wb'))
pickle.dump(cv,open('cv.pkl','wb'))

Methods that you can follow to increase the model performance:

If you want to further increase the model performance for NLP Project, then you can follow some of the below-mentioned methods.

Increase Data: We have 4 Lakh rows, but we were able to use 30k only because of RAM. if you have more memory, try to use more data. Or you can use any cloud platform.
Preprocessing: You can apply more preprocessing methods like exploring data more and more, for example, using stemming.
Multiple Models: You can train multiple algorithms like SVM, perceptron, Gradient boost, Cat boost, etc because you can get a wide range of selection options and models to compare.
Hyperparameter Tuning: Perform Hyperparameter tuning, which will help increase the model performance.
Research: Try to research more on the problem statement and increase the number of features. If you visit Kaggle notebooks, different people have added different features, so you can find and add some of them, which might help boost the performance.
Implement multiple Techniques: You can use techniques like introducing cross-validation, feature extraction of different methods, feature scaling, implementing NLP Projects, etc.
Deep Learning: Try to implement a neural network that will help to increase performance.

Deploy the Machine Learning Model Over the Cloud

It is time to make our machine learning model for NLP Project available for the open-source community people to use. We need an essential website that accepts the two questions from a user and submits an answer as the questions are duplicate or non-duplicate. For creating a website, we will use the Streamlit framework.

1. Create a Web App using Streamlit

Streamlit is a Python framework for creating data web apps without the knowledge of front-end languages. It directly has a function that delivers us an HTML part. So first, we will accept 2 questions from the user, and after that, we need to preprocess those questions and generate all the features before providing them to the model. And for this, I have created a helper.py file that holds all the functions to preprocess and generate the desired feature. To view the complete code of each file, please visit GitHub. Now we can create the streamlit app.py, the main application, so create two input files and one button to submit the form. After which, we will load the model and predict the output displayed to the user.

import streamlit as st
import helper
import pickle

model = pickle.load(open('model.pkl','rb'))

st.header('Duplicate Question Pairs')

q1 = st.text_input('Enter question 1')
q2 = st.text_input('Enter question 2')

if st.button('Find'):
    query = helper.query_point_creator(q1,q2)
    result = model.predict(query)[0]

    if result:
        st.header('Duplicate')
    else:
        st.header('Not Duplicate')

So I hope that in a project directory or a new folder, you have 2 files app.py with streamlit code and helper.py with all the preprocess functions. Now open the terminal in your project folder directory and run the app using the below command. After this streamlit app runs, you will get the localhost URL open in the browser, and the app is running fine.

2. Prepare Cloud files

We are using Heroku as our cloud platform to deploy NLP Project. Before deploying to any cloud, we need to provide some details to the cloud so it can install the run environment with all the libraries needed so the project runs fine.

Procfile

Create the file named Procfile, which commands the cloud to which file to run when starting the server.

web: sh setup.sh && streamlit run app.py

requirements.txt

We need to mention all the required libraries to the cloud so they can install them before the project-run environment server is started.

streamlit
sklearn
fuzzywuzzy
distance
bs4

setup.sh

It is a file that helps the cloud prepare and know the project’s folder structure.

mkdir -p ~/.streamlit/

echo "\
[server]\n\
port = $PORT\n\
enableCORS = false\n\
headless = true\n\
\n\
" > ~/.streamlit/config.toml

Upload the Code to GitHub

We have two alternatives for deploying the NLP Project code to Heroku. One uses the Heroku CLI, while the other uses GitHub. As a result, we shall choose the second choice. Open GitHub and create a new repository. After that, copy the repository’s SSH URL and open Git Bash in your project directory. After that, use the commands below to upload the code to the GitHub repository.

git init
git remote add origin "SSH URL"
git pull origin master
git add -A
git commit -m "Initial Commit"
git push origin master

Deploy to Heroku

Login to Heroku and create a new NLP Project app. Give it a unique name and click Create. After this, you will have a window where you will see two options, so click on Github, and below that, you can find the option to connect with GitHub. You need to connect the GitHub account with Heroku and connect with the repository where you have uploaded the code. After this, simply click on deploy to master branch, and the server will start running. Observe the logs; after a successful server starts, you will get the public URL where your app runs. You can refer to our prior blog, where we have fully detailed the process and explained the deployment.

Conclusion

NLP drives computer programs that translate text from one language to another, respond to spoken commands, and rapidly summarize large volumes of text—even in real-time. There’s a good chance you’ve interacted with NLP in the form of text generated from human queries. And it was a great experience developing the project, and you can say Hurray! We have successfully deployed a cloud-based NLP application. The project is so intense and research-based that it provides a lot of learning about machine learning, where we just started with fundamental data analysis and deep dive to find out 23 new features, and increasing the performance from 74 to approximately 80% was a pleasant journey. I hope you also enjoyed developing this project and that many doubts about machine learning were cleared. Let us quickly summarise the main concepts we have learned from this article.

We learned about NLP projects’ lifecycle and machine learning model development steps.
We have learned the most important aspect of feature engineering in NLP, like how to find helpful new features and add them to the model.
We have learned the machine learning model optimization method using feature engineering.
We have learned performing EDA for NLP Projects between multiple text features and how to look for relationships.
We learned the implementation of the streamlit app for the NLP Projects and deployment over the cloud- with simple steps.

Resources

Below are the links to get the resource to the code and files for easy access and troubleshooting errors while developing the project.

Python Notebook for data analysis – Kaggle
Code file access – GitHub

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

, ' dollar ') q = q.replace('₹', ' rupee ') q = q.replace('€', ' euro ') q = q.replace('@', ' at ') # The pattern '[math]' appears around 900 times in the whole dataset. q = q.replace('[math]', '') # Replacing some numbers with string equivalents q = q.replace(',000,000,000 ', 'b ') q = q.replace(',000,000 ', 'm ') q = q.replace(',000 ', 'k ') q = re.sub(r'([0-9]+)000000000', r'\1b', q) q = re.sub(r'([0-9]+)000000', r'\1m', q) q = re.sub(r'([0-9]+)000', r'\1k', q) # Decontracting words # https://en.wikipedia.org/wiki/Wikipedia%3aList_of_English_contractions # https://stackoverflow.com/a/19794953 contractions = { "ain't": "am not", "aren't": "are not", "can't": "can not", "can't've": "can not have", "'cause": "because", "could've": "could have", "couldn't": "could not", "couldn't've": "could not have", "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hadn't've": "had not have", "hasn't": "has not", "haven't": "have not", "he'd": "he would", "he'd've": "he would have", "he'll": "he will", "he'll've": "he will have", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is", "i'd": "i would", "i'd've": "i would have", "i'll": "i will", "i'll've": "i will have", "i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have", "it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have", "mightn't": "might not", "mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have", "o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have", "so's": "so as", "that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are", "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would", "y'all'd've": "you all would have", "y'all're": "you all are", "y'all've": "you all have", "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have" } q_decontracted = [] for word in q.split(): if word in contractions: word = contractions[word] q_decontracted.append(word) q = ' '.join(q_decontracted) q = q.replace("'ve", " have") q = q.replace("n't", " not") q = q.replace("'re", " are") q = q.replace("'ll", " will") # Removing HTML tags q = BeautifulSoup(q) q = q.get_text() # Remove punctuations pattern = re.compile('\W') q = re.sub(pattern, ' ', q).strip() return q

Add the Past 7 Features

The features we have added above during basic feature engineering will again run that cell to update the values in the dataframe. So I am not adding the past code again; you only need to rerun the code cell to see the upgraded dataframe. And if you have created a new notebook or again working with loading data, copy the old code and run it.

Advance Feature Engineering

Token Based Features

Tokens are chunks of words that unite to form a sentence. A single word is a Token so we will add some of the features at the token level or that will help to calculate some valuable columns.

Cwc min – It represents the number of common words with a minimum number of words in more minor questions.
Cwc max represents the number of common words to a maximum number of words in a larger question.
Csc min – It represents the number of common stop words to the smaller stop word count among two questions.
Csc max represents the number of common stop words to the larger count among two questions.
Ctc min – The number of standard tokens to count more minor tickets among two questions.
Ctc max – The number of standard tokens to count more significant tokens among two questions.
Last word equal – The binary feature includes the value 1 if the last word of both questions is the same, else 0.
The first word equal – If the first word of both questions is equal then the value is 1. Else 0.

Length Based Features

1. Mean Length

It is an essential but crucial feature where we will find the average of both questions’ length, known as the mean length of both questions. For example, if the first question is 8 characters in length and the second is 6 characters in length, then the average becomes 14 divided by 2 equals 7, so the mean is 7.

2. Absolute Length Difference

The absolute difference between the length of the two questions (length of words)

3. Longest Substring Ratio

The ratio of the length of the longest substring between two questions is divided by the length of the smaller questions. The first thing is you need to find the substring in both questions and determine the longest one and then divide it with a length of a small token sentence.

Fuzzy Features

These are some features that are generated using the fuzzy-wuzzy library. To get important about this feature, you can recommend this blog written by the founders of the library itself, where in detail, they explain how these features are calculated.

EDA of Newly Created Features

We have added nearly 23 new features to our dataset, so before modeling, we want to be confident that these features would dominate the prediction of output variables, so we run some analysis to identify specific patterns.

1. Minimum Variables with Target Variable

The feature we have added as a minimum calculation against the target feature to see how they affect record duplication.

We can differentiate between duplicate and non-duplicate. In starting maximum, the blue curve dominates, and after that orange curve dominates.

2. Maximum Variable with Target Variable

We have also appended the maximum calculation, so let us plot them against target variables.

We can see that the blue curve is the most dominant, and the features look helpful in predicting the output.

3. Last Word and First Word Analysis

We plot the first and last word match against the target variable.

We can see that if the last word does not match, there is a good chance that the question has not been copied. Similarly, if the first word is not equal, the likelihood of non-duplicate is high.

4. Length-based Feature Analysis

The mean length and the absolute curve do not give much information because both the curves are moving almost together, but the most extended substring feature is beneficial where a blue curve is dominating.

5. Fuzzy Feature Analysis

All 4 features give a good understanding of the output variables, which can be useful.

Dimensionality Reduction

We will employ TSNE (T-distributed stochastic neighbor Embedding), a non-linear unsupervised dimensionality reduction technique for data exploration and visualization. First, we’ll plot the data on the 2D graph. Then, we’ll use it to visualize the data in 3-D, allowing you to see the impact of 15 features on the target variable. Visit the notebook to get the code for the plotly 3-D view.

We can see from the graph that the explanation between 0 and 1 is causing the difference and impacting adding the 15 new features to the data. If you view it in 3-d view, the difference will be clearer.

Machine Learning Modeling Part

The data is now ready, and you must repeat the steps above to train the Random Forest and XGboost models for NLP Project. You can rerun the cell or copy and paste the code. Only the difference here is we have 15 more features which total become 6023 by adding 15 more features. The random forest accuracy is approximately 78.7, and XGboost gives 79.2 percent. So by doing this much optimization, we could boost the performance by 2 to 2.5 percent. We believe that the performance will only increase slowly after the model gets the main explanation.

Selecting the Best Model

This is where many engineers and practitioners make the mistake of picking the best model for deployment. So we have two underlined scenarios to consider while selecting between the models.

When the real value is non-duplicate, but the model reports it as such.
When the actual value is duplicated, the model predicts it is not.

So, If you think over both scenarios, the first example, in which the model predicts duplicate, is riskier because the user experience is terrible. So, if you plot the confusion matrix for both models, the Random forest is the model that makes fewer mistakes in the first scenario so we will go with the Random forest for deployment. So use the Pickle library to preserve the model.

Methods that you can follow to increase the model performance:

If you want to further increase the model performance for NLP Project, then you can follow some of the below-mentioned methods.

Increase Data: We have 4 Lakh rows, but we were able to use 30k only because of RAM. if you have more memory, try to use more data. Or you can use any cloud platform.
Preprocessing: You can apply more preprocessing methods like exploring data more and more, for example, using stemming.
Multiple Models: You can train multiple algorithms like SVM, perceptron, Gradient boost, Cat boost, etc because you can get a wide range of selection options and models to compare.
Hyperparameter Tuning: Perform Hyperparameter tuning, which will help increase the model performance.
Research: Try to research more on the problem statement and increase the number of features. If you visit Kaggle notebooks, different people have added different features, so you can find and add some of them, which might help boost the performance.
Implement multiple Techniques: You can use techniques like introducing cross-validation, feature extraction of different methods, feature scaling, implementing NLP Projects, etc.
Deep Learning: Try to implement a neural network that will help to increase performance.

Deploy the Machine Learning Model Over the Cloud

It is time to make our machine learning model for NLP Project available for the open-source community people to use. We need an essential website that accepts the two questions from a user and submits an answer as the questions are duplicate or non-duplicate. For creating a website, we will use the Streamlit framework.

1. Create a Web App using Streamlit

Streamlit is a Python framework for creating data web apps without the knowledge of front-end languages. It directly has a function that delivers us an HTML part. So first, we will accept 2 questions from the user, and after that, we need to preprocess those questions and generate all the features before providing them to the model. And for this, I have created a helper.py file that holds all the functions to preprocess and generate the desired feature. To view the complete code of each file, please visit GitHub. Now we can create the streamlit app.py, the main application, so create two input files and one button to submit the form. After which, we will load the model and predict the output displayed to the user.

So I hope that in a project directory or a new folder, you have 2 files app.py with streamlit code and helper.py with all the preprocess functions. Now open the terminal in your project folder directory and run the app using the below command. After this streamlit app runs, you will get the localhost URL open in the browser, and the app is running fine.

2. Prepare Cloud files

We are using Heroku as our cloud platform to deploy NLP Project. Before deploying to any cloud, we need to provide some details to the cloud so it can install the run environment with all the libraries needed so the project runs fine.

Procfile

Create the file named Procfile, which commands the cloud to which file to run when starting the server.

requirements.txt

We need to mention all the required libraries to the cloud so they can install them before the project-run environment server is started.

setup.sh

It is a file that helps the cloud prepare and know the project’s folder structure.

Upload the Code to GitHub

We have two alternatives for deploying the NLP Project code to Heroku. One uses the Heroku CLI, while the other uses GitHub. As a result, we shall choose the second choice. Open GitHub and create a new repository. After that, copy the repository’s SSH URL and open Git Bash in your project directory. After that, use the commands below to upload the code to the GitHub repository.

Deploy to Heroku

Login to Heroku and create a new NLP Project app. Give it a unique name and click Create. After this, you will have a window where you will see two options, so click on Github, and below that, you can find the option to connect with GitHub. You need to connect the GitHub account with Heroku and connect with the repository where you have uploaded the code. After this, simply click on deploy to master branch, and the server will start running. Observe the logs; after a successful server starts, you will get the public URL where your app runs. You can refer to our prior blog, where we have fully detailed the process and explained the deployment.

Conclusion

NLP drives computer programs that translate text from one language to another, respond to spoken commands, and rapidly summarize large volumes of text—even in real-time. There’s a good chance you’ve interacted with NLP in the form of text generated from human queries. And it was a great experience developing the project, and you can say Hurray! We have successfully deployed a cloud-based NLP application. The project is so intense and research-based that it provides a lot of learning about machine learning, where we just started with fundamental data analysis and deep dive to find out 23 new features, and increasing the performance from 74 to approximately 80% was a pleasant journey. I hope you also enjoyed developing this project and that many doubts about machine learning were cleared. Let us quickly summarise the main concepts we have learned from this article.

We learned about NLP projects’ lifecycle and machine learning model development steps.
We have learned the most important aspect of feature engineering in NLP, like how to find helpful new features and add them to the model.
We have learned the machine learning model optimization method using feature engineering.
We have learned performing EDA for NLP Projects between multiple text features and how to look for relationships.
We learned the implementation of the streamlit app for the NLP Projects and deployment over the cloud- with simple steps.

Resources

Below are the links to get the resource to the code and files for easy access and troubleshooting errors while developing the project.

Python Notebook for data analysis – Kaggle
Code file access – GitHub

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.

Raghav Agrawal 08 Aug 2023

I am a final year undergraduate who loves to learn and write about technology. I am a passionate learner, and a data science enthusiast. I am learning and working in data science field from past 2 years, and aspire to grow as Big data architect.

Advanced Data Science Github Machine Learning NLP