Raghav Kachroo — February 4, 2022
Beginner Data Science NLP

Feature Engineering on text data using Natural Language Processing Techniques.

This article focuses primarily on text data feature engineering. Within the same process, we will be going over the following techniques and processes:

  • Lemmatization / Stemming

  • Count Vectorizer

  • One Hot Encoding

  • Train Test Split

  • Principal Component Analysis

  • Some general text cleaning and null value imputation techniques

  • Explanatory Data Analysis

  • Linear Discriminant Analysis

  • Hyper-parameter Tuning

 

Please note that the techniques used within our process are finalized after much trial and error. The use case of these or similar techniques will always be dependent on the use case.

The evaluation parameter for the algorithms is the F1 Score as we want to keep the balance between our Precision score and our Recall score.

 

feature_engineering | Supreme Court Judgement Prediction

The Data

Let’s start out by understanding a little about the dataset used within this process. The dataset was sourced from the following link, through Kaggle.

https://www.kaggle.com/deepcontractor/supreme-court-judgment-prediction

 

< df = pd.read_csv('../input/supreme-court-judgment-prediction/justice.csv', delimiter=',', encoding = "utf8")

df.dropna(inplace=True)

df.head() >

 

The CSV file contains 3,303 rows and 16 columns. First_party_winner  is the target column.

Columns Unnamed: 0, docket, name, first_party, second_party, issue_area,  facts_len, majority_vote, minority_vote, href, ID, term were dropped since their contribution towards the target variable prediction is insignificant. 

The remaining dependent variables are decision type, disposition and facts.

Missing values were dropped using .dropna(). The % of null values was less than 5%, thus they were dropped directly without the need for imputation.

 

< df.drop(columns=['Unnamed: 0', 'docket','name','first_party','second_party', 'issue_area', 'facts_len', 'majority_vote', 'minority_vote', 'href', 'ID','term'], inplace=True) >

 

We separate the dataset into target variables and two groups of independent variables, one (df_cat) which requires one-hot encoding to be machine-readable and the other (df_nlp) which is text data that needs to be cleaned before features can be engineered from it.

 

< df_cat = df[['decision_type', 'disposition']]

df_target = df['first_party_winner']

df_nlp = df['facts'] >


Next, we reset the indices to avoid NaN values during concatenation. Resetting the indices also enables us to perform one-hot encoding on categorical data without raising errors.

 

< df_cat.reset_index(drop=True, inplace=True)

df_target.reset_index(drop=True, inplace=True)

df_nlp.reset_index(drop=True, inplace=True) >

 

We begin by label encoding our target column ‘first_party_winner’, converting its values from True or False to 1 and 0 respectively.

 

< from sklearn import preprocessing

label_encoder = preprocessing.LabelEncoder()

df_target= label_encoder.fit_transform(df_target)




df_target1 = pd.DataFrame(df_target, columns=['first_party_winner'])

df_target1 >



Next we work on the ‘facts’ column by performing feature engineering on this column.



< df_nlp1 = pd.DataFrame(df_nlp, columns=['facts'])




df_nlp1['facts'] = df_nlp1['facts'].str.replace(r'<[^<>]*>', '', regex=True)

df_nlp1 >

 

We use the above-given code to perform initial cleaning on our ‘facts’ feature.

Next, we tokenize our corpus and define a function to allow our text to be cleaned further using Regex and implement Lemmatization. Remember that you should run either 

Stemming or Lemmatization on your data, never both.

 

< corpus = df_nlp1["facts"]

lst_tokens = nltk.tokenize.word_tokenize(corpus.str.cat(sep=" "))




ps = nltk.stem.porter.PorterStemmer()

lem = nltk.stem.wordnet.WordNetLemmatizer()




lst_stopwords = nltk.corpus.stopwords.words("english")




def utils_preprocess_text(text, flg_stemm=False, flg_lemm=True, lst_stopwords=None):

    ## clean (convert to lowercase and remove punctuations and characters and then strip)

    text = re.sub(r'[^\w\s]', '', str(text).lower().strip())

            

    ## Tokenize (convert from string to list)

    lst_text = text.split()    ## remove Stopwords

    if lst_stopwords is not None:

        lst_text = [word for word in lst_text if word not in 

                    lst_stopwords]

                

    ## Stemming (remove -ing, -ly, ...)

    if flg_stemm == True:

        ps = nltk.stem.porter.PorterStemmer()

        lst_text = [ps.stem(word) for word in lst_text]

                

    ## Lemmatisation (convert the word into root word)

    if flg_lemm == True:

        lem = nltk.stem.wordnet.WordNetLemmatizer()

        lst_text = [lem.lemmatize(word) for word in lst_text]

            

    ## back to string from list

    text = " ".join(lst_text)

    return text




df_nlp1["facts_clean"] = df_nlp1["facts"].apply(lambda x: utils_preprocess_text(x, flg_stemm=False, flg_lemm=True, lst_stopwords=lst_stopwords)) >

EDA

 

The following code snippets essentially allow us to visualise words and their frequencies across our whole dataset. We also filter our data by target variable value to draw insights from the shape and distribution of our dataset. Finally, we also judge the capacity of an lda Model to discriminate between different topics within our dataset. Note that y = 0 and y = 1 both will be run to visualise both binary target variable options.

 

< df_nlp2 = pd.concat([df_nlp1,df_target1['first_party_winner']],axis=1, join='inner') 




y = 1 

corpus = df_nlp2[df_nlp2["first_party_winner"]== y]["facts_clean"]

lst_tokens = nltk.tokenize.word_tokenize(corpus.str.cat(sep=" "))

fig, ax = plt.subplots(nrows=2, ncols=1)

fig.suptitle("Most frequent words", fontsize=15)

#figure(figsize=(30, 24))

## unigrams

dic_words_freq = nltk.FreqDist(lst_tokens)

dtf_uni = pd.DataFrame(dic_words_freq.most_common(), 

                       columns=["Word","Freq"])

dtf_uni.set_index("Word").iloc[:10,:].sort_values(by="Freq").plot(

                  kind="barh", title="Unigrams", ax=ax[0], 

                  legend=False).grid(axis='x')

ax[0].set(ylabel=None)

    

## bigrams

dic_words_freq = nltk.FreqDist(nltk.ngrams(lst_tokens, 2))

dtf_bi = pd.DataFrame(dic_words_freq.most_common(), 

                      columns=["Word","Freq"])

dtf_bi["Word"] = dtf_bi["Word"].apply(lambda x: " ".join(

                   string for string in x) )

dtf_bi.set_index("Word").iloc[:10,:].sort_values(by="Freq").plot(

                  kind="barh", title="Bigrams", ax=ax[1],

                  legend=False).grid(axis='x')

ax[1].set(ylabel=None)

plt.show()

 

import wordcloud

wc = wordcloud.WordCloud(background_color='black', max_words=100, 

                         max_font_size=35)

wc = wc.generate(str(corpus))

fig = plt.figure(num=1)

plt.axis('off')

plt.imshow(wc, cmap=None)

plt.show()

 

y = 1

corpus = df_nlp2[df_nlp2["first_party_winner"]==y]["facts_clean"]

## pre-process corpus

lst_corpus = []

for string in corpus:

    lst_words = string.split()

    lst_grams = [" ".join(lst_words[i:i + 2]) for i in range(0, 

                     len(lst_words), 2)]

    lst_corpus.append(lst_grams)## map words to an id

id2word = gensim.corpora.Dictionary(lst_corpus)## create dictionary word:freq

dic_corpus = [id2word.doc2bow(word) for word in lst_corpus] ## train LDA

lda_model = gensim.models.ldamodel.LdaModel(corpus=dic_corpus, id2word=id2word, num_topics=7, random_state=123, update_every=1, chunksize=100, passes=10, alpha='auto', per_word_topics=True)

   

## output
lst_dics = []

for i in range(0,3):

    lst_tuples = lda_model.get_topic_terms(i)

    for tupla in lst_tuples:

        lst_dics.append({"topic":i, "id":tupla[0], 

                         "word":id2word[tupla[0]], 

                         "weight":tupla[1]})

dtf_topics = pd.DataFrame(lst_dics, 

                         columns=['topic','id','word','weight'])




## plot

fig, ax = plt.subplots()

sns.barplot(y="word", x="weight", hue="topic", data=dtf_topics, dodge=False, ax=ax).set_title('Main Topics')

ax.set(ylabel="", xlabel="Word Importance")

plt.show() >



We then import Count Vectorizers to allow us to streamline our process of vectorizing our facts column. Then, we predict results using different Machine Learning models.

Before we make our predictions, however, we must perform train_test_split on our data to avoid overfitting our model on given data. Avoiding this step will give us very high accuracy on our given data but will make our model useless on unseen data.

 

< from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()



xfeatures = df_nlp2['facts_clean']

ylabel = df_nlp2['first_party_winner']




X_train, X_test, y_train, y_test = train_test_split(xfeatures,ylabel, test_size=0.25) >


After performing a train test split we will fit our pipeline on three different models, namely RandomForest, KNeighbors and Logistic Regression. 

 

< pipe = Pipeline(steps=[('cv',CountVectorizer()),('lr',LogisticRegression(solver='liblinear'))])

pipe.fit(X_train,y_train)

pipe.score(X_test,y_test)



pipe1= Pipeline(steps=[('cv',CountVectorizer()),('rf',RandomForestClassifier())])

pipe1.fit(X_train,y_train)

pipe1.score(X_test,y_test)



pipe2= Pipeline(steps=[('cv',CountVectorizer()),('rf',KNeighborsClassifier(n_neighbors=3))])

pipe2.fit(X_train,y_train)

pipe2.score(X_test,y_test) >



Logistic Regression Fit Summary: This model reaches an accuracy of  54%, which is very weak and only slightly better than random guessing.

 

XGBoost Fit Summary: This model reaches an accuracy of 63%, which reinforces the assumption that logistic regression is not capable of capturing the trend in the data properly. This model is better, but still quite weak.

 

KNN Fit Summary: This model reaches an accuracy of 59%, which is again quite weak but still able to capture more trend than logistic regression.

 

Random Forest Fit Summary: This model reaches an accuracy of 64%. This is the best response we get from among the chosen models. 

 

After fitting all these algorithms we find that Random Forest gives us the best accuracy with F1 Score.

 

Now that we have baseline accuracies, we will add back the one-encoded versions of the disposition and decision_type columns. We have a total of 20,375 columns (including the one-hot encoding columns and the count_vectorizer columns). While the computation time for only the vectorized columns was not necessarily high, we ideally would perform dimensionality reduction on such a wide dataset. Therefore, we perform LDA and reduce our columns down to 200.

 

< df_cat1 = pd.get_dummies(df_cat['decision_type'])

df_cat2 = pd.get_dummies(df_cat['disposition'])



df_cat3=pd.concat([df_cat2,df_cat1],axis=1,join='inner')

df_cat3=pd.concat([df_cat3,df_nl1['first_party_winner']],axis=1,join='inner')


vectorize=CountVectorizer()


count_matrix = vectorize.fit_transform(df_nl1['facts_clean'])

count_array = count_matrix.toarray()

data_final = pd.DataFrame(data=count_array,columns = vectorize.get_feature_names())




data_final = pd.concat([data_final,df_cat3],axis=1,join='inner')


from sklearn.decomposition import LatentDirichletAllocation
lda = LatentDirichletAllocation(n_components=200, random_state=0)

lda_data = lda.fit_transform(X_train)

lda_data_train = pd.DataFrame(data=lda_data)

lda_data_test = pd.DataFrame(data=lda.transform(X_test)) >



Finally, we fit the above algorithms again to score our final models after one-hot encoded data has been included and LDA has been performed.

 

Logistic Regression Fit Summary: This model reaches an accuracy of  58%, which is very weak and only slightly better than random guessing.

 

XGBoost Fit Summary: This model reaches an accuracy of 65%, which reinforces the assumption that logistic regression is not capable of capturing the trend in the data properly. This model is better, but still quite weak.

 

KNN Fit Summary: This model reaches an accuracy of 63%, which is again quite weak but still able to capture more trends than logistic regression.

 

Random Forest Fit Summary: This model reaches an accuracy of 67%. This is the best response we get from among the chosen models. 

 

Yet again, Random Forest performed the best amongst all the algorithms with an accuracy of 67%.

But let’s see if we can increase this accuracy by changing a few hyperparameters inside the Random Forest. The problem here is that we will have to run the GridSearch CV for any combination of parameters to find the best in order to get the optimal accuracy. 

 

To avoid such high computation we input a range of values that a parameter can take and then run GridSearchCV.

 

The best parameter given by GridSearchCV are:

 

max_depth= 8, max_features = 100, min_samples_leaf = 2, n_estimators = 200

 

< rand=RandomForestClassifier(max_depth= 8, max_features = 100, min_samples_leaf = 2, n_estimators = 200)

 

rand.fit(lda_data_train,y_train)

rand.score(lda_data_train,y_train) >

 

We fit another Random Forest algorithm with the ideal combination of parameters.

The overall accuracy increased by 3%.

 

Thank you for taking the time to read through our process. We hope you could take something away to enhance your learning and help enable your own process.

 

Read more articles on our blog.

About the Author

Raghav Kachroo

Our Top Authors

Download Analytics Vidhya App for the Latest blog/Article

Leave a Reply Your email address will not be published. Required fields are marked *