This article starts by discussing the fundamentals of Natural Language Processing (NLP) and later demonstrates using Automated Machine Learning (AutoML) to build models to predict the sentiment of text data. Other applications of NLP are for translation, speech recognition, chatbot, etc. You may be thinking that this article is general because there are many NLP tutorials and sentiment analyses on the internet. But, this article tries to show something different. It will demonstrate the use of AutoKeras as an AutoML to generate Deep Learning to predict text, especially sentiment rating and emotion. But before that, let’s briefly discuss basic NLP because it supports text sentiment prediction.
This article will cover the following topics:
Regular expression
Word tokenization
Named Entity Recognition
Stemming and lemmatization
Word cloud
Bag-of-words (BoW)
Term Frequency — Inverse Document Frequency (TF-IDF)
Sentiment analysis
Text Regression (Automated Machine Learning and Deep Learning)
Text Classification (Automated Deep Learning)
NLP aims to make the sense of text data. The examples of text data commonly analyzed in Data Science are reviews of products, posts from social media, documents, etc. Unlike numerical data, text data cannot be analyzed with descriptive statistics. If we have a list of product prices data containing 1000 numbers, we can understand the overall prices data by examining the average, median, standard deviation, boxplot, and other technics. We do not have to read all the numbers to understand them.
Now, if we have thousands of texts reviewing products from an e-commerce online store, how do we know what the reviews are saying without reading them all. With NLP, those text reviews can be interpreted into satisfaction rating, emotion, etc. This is called sentiment analysis. Machine Learning models are created to predict the sentiment of the text.
Before getting into the sentiment analysis, this article will start from the very basics of NLP, such as regular expression, word tokenization, until Bag-of-Words and how they contribute to the sentiment analysis. Here is the Python notebook supporting this article.
Regular Expression
Regular Expression (RegEx) is a pattern to match, search, find, or split one or more sentences or words. The following code is an example of using 6 RegEx lines to match the same sentence. The RegEx w+, d+, s, [a-z]+, [A-Z]+, (w+|d+) has the code to search for word, digit, space, small-cap alphabet, big-cap alphabet, and digit respectively. The re.match will return the respective text with a certain pattern for the first word of the text. In this case, it is the word “The”.
Notice that the second, third, and fourth returns “None” as the word “The” does not start with a digit, space, and small alphabet.
The RegEx re.search searches the first text according to the pattern. Unlike re.match which only checks the first word in the text, re.search can identify the words after the first word in the text.
print(re.search('w+', text)) # word
print(re.search('d+', text)) # digit
print(re.search('s', text)) # space
print(re.search('[a-z]+', text)) # small caps alphabet
print(re.search('[a-z]', text)) # small caps alphabet
print(re.search('[A-Z]+', text)) # big caps alphabet
print(re.search('(w+|d+)', text)) # word or digit
re.split searches for the RegEx pattern in the whole text and splits the text into list of strings based on the RegEx pattern.
print(re.split('w+', text)) # word
print(re.split('d+', text)) # digit
print(re.split('s', text)) # space
print(re.split('[a-z]+', text)) # small caps alphabet
print(re.split('[A-Z]+', text)) # big caps alphabet
print(re.split('(w+|d+)', text)) # word or digit
There are still many RegEx patterns to explore, but this article will continue to word tokenization. Word tokenization splits sentences into words. The below code shows how it is done. Can you tell which RegEx code above can do the same thing?
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
word_tokenize(text)
Besides word tokenization, we can also perform sentence tokenization.
text2 = 'The monkeys are eating 7 bananas on the tree! The tree will only have 5 bananas left later. One monkey is jumping to another tree.'
sent_tokenize(text2)
Output:
['The monkeys are eating 7 bananas on the tree!',
'The tree will only have 5 bananas left later.',
'One monkey is jumping to another tree.']
Named Entity Recognition (NER)
After the word tokenization, we can apply NER to it. NER identifies which entity each word is. Please find the example below of how pos_tag from nltk is used to perform NER.
Where, DT=determiner, NNS=plural noun, VBP= verb for non-3rd person singular present, VBG= gerund/present participle verb, CD= cardinal digit, IN= preposition/subordinating conjunction, NN=noun, MD=modal verb, RB=adverb, VB=base form verb, and so on.
Another way to perform NER is by using spacy package. Observe the following example of how spacy identifies each word lemmatization, PoS, tag, dep, shape, whether it is an alphabet, and whether it is a stop word. We will discuss lemmatization later. PoS and tag define the part of speech, such as a determiner, noun, auxiliary verb, number, etc. “dep” shows the word dependencies. “shape” shows the word letters in X and x for big capital and small capital letter respectively. “is_alphabet” and “is_stop_words” identify whether the word is an alphabet or stop word respectively.
text lemmatized PoS tag dep shape is_alphabet is_stop_words
The the DET DT det Xxx True True
monkeys monkey NOUN NNS nsubj xxxx True False
are be AUX VBP aux xxx True True
eating eat VERB VBG ROOT xxxx True False
7 7 NUM CD nummod d False False
bananas banana NOUN NNS dobj xxxx True False
on on ADP IN prep xx True True
the the DET DT det xxx True True
tree tree NOUN NN pobj xxxx True False
! ! PUNCT . punct ! False False
Stemming and Lemmatization
Stemming and Lemmatization return a word to its simpler root form. Both stemming and lemmatization are similar to each other. To understand the difference, observe the following code. Here, we apply stemming and lemmatization to the word “studies” and they will return different outputs. Stemming returns “studi” as the root form of “studies”. Lemmatization returns “study” as the root form of “studies”. The root form returned by lemmatization has a meaning. The root form of stemming sometimes does not have a meaning. The word “studi” from stemming does not have a meaning. Stemming cannot change the letter “i” from the word “studies”.
# Stemming
from nltk.stem import PorterStemmer
print(PorterStemmer().stem('studies'))
# Lemmatization
from nltk.stem import WordNetLemmatizer
print(WordNetLemmatizer().lemmatize('studies'))
Output:
studi
study
Word Cloud
For this exercise, we are going to use women’s clothing reviews from an e-commerce dataset. The dataset provides the text reviews and the rating score from 1 to 5. We are now trying to understand what the 23,486 reviews were saying. If the reviews are in numbers, we can use descriptive statistics to see the data distribution. But, the reviews are in text form. How to quickly get a summary of the 23,486 reviews text? Reading them one by one is not an efficient solution.
A simple way is to plot the word cloud. Word cloud displays the commonly found words from the whole dataset. A larger font size means more frequently found.
From the word cloud, we can notice that the reviews are talking about dress, love, size, top, wear, and so on as they are the most commonly found words. To display the exact frequency number of each word, we can use Counter(). Let’s demonstrate it using the variable “text2”.
from collections import Counter
Counter(word_tokenize(text2))
The following code does the same thing, but it calls only the top 3 most common words.
Counter(word_tokenize(text2)).most_common(3)
Output:
[('tree', 3), ('The', 2), ('bananas', 2)]
Bag-of-Words (BoW)
Bag-of-Words does a similar thing. It returns a table with features consisting of the words in the reviews. The row contains the word frequency. The following code applies BoW to the women’s clothing review dataset. It will create a data frame with tokenized words as the features.
In the CountVectoricer, I set max_features to be 100 to limit the number of features. “max_df” and “min_df” determine the maximum and minimum appearance percentage of the tokenized words in the documents. The selected tokenized words should appear in more than 10% and less than 95% of the documents. This is to deselect words that appear too rarely and too frequently. “ngram_range” of (1,2) is set to tokenize 1 word and 2 consecutive words (2-word sequence or bi-gram). This is important to detect two-word sequences, like “black mamba”, “land cover”, “not happy”, “not good”, etc. Notice that lemmatizer and regular expression are also used.
# Filter rows with column
data = dataset.loc[dataset['Review Text'].notnull(),:]
# Apply uni- and bigram vectorizer
class lemmatizer(object):
def __init__(self):
self.wnl = WordNetLemmatizer()
def __call__(self, df):
return [self.wnl.lemmatize(word) for word in word_tokenize(df)]
vectorizer = CountVectorizer(max_features=100, max_df=0.95, min_df=0.1, ngram_range=(1,2),
tokenizer=lemmatizer(), lowercase=True, stop_words='english',
token_pattern = r'w+')
vectorizer.fit(data['Review Text'])
count_vector = vectorizer.transform(data['Review Text'])
# Transform into data frame
bow = count_vector.toarray()
bow = pd.DataFrame(bow, columns=vectorizer.get_feature_names())
bow.head()
!
beautiful
…
ordered
perfect
really
run
size
small
soft
wa
wear
work
0
0
0
…
0
0
0
0
0
0
0
0
0
0
1
1
0
…
1
0
0
0
0
0
0
0
0
0
2
1
0
…
1
0
1
0
1
3
0
3
0
1
3
2
0
…
0
0
0
0
0
0
0
0
1
0
4
3
0
…
0
1
0
0
0
0
0
0
1
0
…
…
…
…
…
…
…
…
…
…
…
…
…
…
Term Frequency — Inverse Document Frequency (TF-IDF)
Similar to BoW, TF-IDF also creates a data frame with features of tokenized words. But, it tries to scale up the rare terms and scale down the frequent terms. This is useful, for example, the word “the” may appear many times, but it is not what we expect as it does not have a sentiment tendency. The values of TF-IDF are generated in 3 steps. Step 1, calculate the TF = count of term/number of words. For example, let’s apply these 3 sentences:
1. The monkeys are small. The ducks are also small.
2. The comedians are hungry. The comedians then go to eat.
3. The comedians have small monkeys.
Term
TF 1
TF 2
TF 3
the
2/9
2/10
1/5
monkeys
1/9
0/10
1/5
are
2/9
1/10
0/5
small
2/9
0/10
1/5
ducks
1/9
0/10
0/5
comedians
0/9
2/10
1/5
hungry
0/9
1/10
0/5
…
…
…
…
The first text contains 9 words. It has 2 words of “the”. So, the word “the” in TF 1 is 2/9.
Step 2, IDF = log (number of documents/number of documents with the term)
Term
IDF
the
log(3/3)
monkeys
log(3/2)
are
log(3/2)
small
log(3/2)
ducks
log(3/1)
comedians
log(3/2)
hungry
log(3/1)
…
…
The word “monkeys” appears 2 times in 3 documents. So, the IDF is log(3/2).
Step 3, calculate the TF-IDF = TF * IDF
Term
TF 1
TF 2
TF 3
IDF
TF-IDF 1
TF-IDF 2
TF-IDF 3
the
2/9
2/10
1/5
log(3/3)
0.000
0.000
0.000
monkeys
1/9
0/10
1/5
log(3/2)
0.020
0.000
0.035
are
2/9
1/10
0/5
log(3/2)
0.039
0.018
0.000
small
2/9
0/10
1/5
log(3/2)
0.039
0.000
0.035
ducks
1/9
0/10
0/5
log(3/1)
0.053
0.000
0.000
comedians
0/9
2/10
1/5
log(3/2)
0.000
0.035
0.035
hungry
0/9
1/10
0/5
log(3/1)
0.000
0.048
0.000
…
…
…
…
…
…
…
…
Here is the output table.
Term
the
monkeys
are
small
ducks
comedians
hungry
…
1
0.000
0.020
0.039
0.039
0.053
0.000
0.000
…
2
0.000
0.000
0.018
0.000
0.000
0.035
0.048
…
3
0.000
0.035
0.000
0.035
0.000
0.035
0.000
…
Now, let’s compare it with the BoW data frame below. The word “the” in BoW data frame has the values of 2, 2, and 1 for text 1, 2, and 3 respectively. On the other hand, the same word has 0 value for all of the 3 texts in the TF-IDF data frame. This is because the word “The” appears in all 3 texts. Hence, the IDF is zero (log(3/3)). The words “monkeys” and “ducks” appear once in the first text, but “ducks” has a higher value (0.053) compared to the “monkeys” value (0.020) in the TF-IDF data frame. The word “ducks” appears less in all documents than “monkeys” does, so it gives more highlight to the word “duck”.
Text
the
monkeys
are
small
ducks
comedians
hungry
…
1
2
1
2
2
1
0
0
…
2
2
0
1
0
0
2
1
…
3
1
1
0
1
0
1
0
…
Here is the code to apply TF-IDF to the women’s clothing dataset.
What is the use of text converted into BoW or TF-IDF data frame? It is very important if we want to apply Machine Learning to text data. Machine Learning does not understand text, so the text must be converted into a numeric data frame. One of the common use of Machine Learning for text prediction is sentiment analysis. Sentiment analysis can predict the sentiment of the review text. Instead of reading the reviews one by one, sentiment analysis can convert the text into how satisfied the reviews sound.
Sentiment Analysis
Sentiment analysis can be run by using TextBlob or training a Machine Learning model. TextBlob does not require training. It can tell the polarity and subjectivity of the reviews. The polarity ranges from 1 to -1 expressing positive sentiment to negative sentiment. Here is the code to apply sentiment analysis to the text2 = ‘The monkeys are eating 7 bananas on the tree! The tree will only have 5 bananas left later. One monkey is jumping to another tree.’
from textblob import TextBlob
TextBlob(text2).sentiment
Output:
Sentiment(polarity=-0.0125, subjectivity=0.25)
Now, let’s see how it works on the women’s clothing dataset.
The table above displays the first 5 rows of the review text polarity. Examine how the words in “Review Text” return the “polarity”. In the same dataset, the satisfaction rating is also given by each of the reviewers in the feature “Rating”. The “Rating” ranges from 1 to 5. The figure below visualizes the polarity distribution of each rating class. Examine that a higher rating tends to have more positive polarity.
Fig. 2 Polarity distribution of each rating class (source: image by author)
Text Regression (Automated Machine Learning and Deep Learning)
The sentiment analysis provided by TextBlob is a pre-trained model. Users can directly apply it to text. But, what if we want to train our new model to analyze text? What if we want to build a model to detect sentiment ranging from 1 to 5, instead of the text positivity and negativity rate (like using TextBlob)?
Using TextBlob has the advantage that users do not have to train a large number of data. On the other hand, creating a new Machine Learning model for text analysis requires extra time and resources. But, the created new model will be much more customized to a specific topic based on a training dataset. For example, training women’s clothing review dataset will result in an NLP model that is more specific in predicting the sentiment.
In Machine Learning for text analysis or NLP, there are text regression and text classification. Text regression aims to analyze text with continuous or ordinal output. Predicting the “rating” of women’s clothing reviews dataset requires text regression analysis because the output ranges from 1 to 5. The Text Regression notebook is available here.
Just like Machine Learning for structured data, Machine Learning for NLP also requires data frames containing engineered features and label. The engineered features for predictors can be generated using BoW or TF-IDF. The 2 data frames have been made before. According to how the data frames are generated, let’s call them BoW data frame and TF-IDF data frame.
To train a Machine Learning model for text analysis, the technic is the same as training it for structured data. We can use linear regression, decision tree, gradient boosting tree, etc. The code below uses AutoSklearn to train NLP Machine Learning with BoW data frame. AutoSklearn is an automated Machine Learning that can perform feature engineering, model selection, and hyperparameter-tuning automatically. Users can skip those processes and get a model in a specified time allocation. The AutoSklearn below is set to find an optimal model in 3 minutes. To find out more about Automated Machine Learning, visit my previous article.
# Create the model
sklearn = AutoSklearnRegressor(time_left_for_this_task=3*60, per_run_time_limit=60, n_jobs=-1)
# Fit the training data
sklearn.fit(X_train, y_train)
The output prediction of the text regression is a continuous number ranging from 1 to 5. The objective is to predict the review rating ranging from 1 to 5 in integer. So, the predicted values must be rounded to the closest integer between 1 to 5.
# Sprint Statistics
print(sklearn.sprint_statistics())
# Predict the test data
pred_sklearn = sklearn.predict(X_test)
pred_sklearn2 = [round(i) for i in pred_sklearn]
After creating the model, the next step is to validate the model with the unseen dataset. The RMSE of the predicted unseen dataset is 0.9886. However, if we check the prediction outputs with the true values using the confusion matrix, we can find that the model cannot predict well in ratings 1, 2, and 3. Note that this does not necessarily mean that AutoSklearn is not good enough. The AutoSklearn was given only 3 minutes to create models automatically. The result RMSE can be better if it has more time.
auto-sklearn results:
Dataset name: 2fbec688-f37c-11eb-8112-0242ac130202
Metric: r2
Best validation score: 0.287286
Number of target algorithm runs: 27
Number of successful target algorithm runs: 9
Number of crashed target algorithm runs: 0
Number of target algorithms that exceeded the time limit: 9
Number of target algorithms that exceeded the memory limit: 9
RMSE: 0.9885634389230906
Let’s repeat the processes above, but this time by using the TF-IDF data frame as the input. AutoSklearn is also applied and the RMSE of the unseen dataset is 0.9770. The result is also similar to the previous one. The predictions are bad in predicting the rating of 1, 2, and 3.
# Create the model
sklearn_idf = AutoSklearnRegressor(time_left_for_this_task=3*60, per_run_time_limit=60, n_jobs=-1)
# Fit the training data
sklearn_idf.fit(X_train_idf, y_train)
# Sprint Statistics
print(sklearn_idf.sprint_statistics())
# Predict the test data
pred_sklearn_idf = sklearn_idf.predict(X_test_idf)
pred_sklearn_idf2 = [round(i) for i in pred_sklearn_idf]
# Compute the RMSE
rmse_sklearn_idf = mean_squared_error(y_test, pred_sklearn_idf2)**0.5
print('RMSE: ' + str(rmse_sklearn_idf))
Output:
auto-sklearn results:
Dataset name: a573493e-f37f-11eb-8112-0242ac130202
Metric: r2
Best validation score: 0.285567
Number of target algorithm runs: 26
Number of successful target algorithm runs: 8
Number of crashed target algorithm runs: 0
Number of target algorithms that exceeded the time limit: 8
Number of target algorithms that exceeded the memory limit: 10
RMSE: 0.9769930120276676
# Prediction results
print('Confusion Matrix')
pred_sklearn_idf3 = [i if i <= 5 else 5 for i in pred_sklearn_idf2]
print(pd.DataFrame(confusion_matrix(y_test, pred_sklearn_idf3), index=[1,2,3,4,5], columns=[1,2,3,4,5]))
To see which Machine Learning algorithms are created from the AutoSklearn, run the below code.
# Show the models
print(sklearn_idf.show_models())
Now, let’s use another autoML, the AutoKeras. AutoKeras automatically creates Deep Learning models. And yes, not only does it cover the hyperparameter-tuning, but also the Deep Learning layers architecture. After installing and importing the AutoKeras package, we can start the data preparation. The data are prepared in array format separating the training and test datasets. Notice that the feature contains only one column the “Review Text”. AutoKeras does not require users to process applying BoW or TF-IDF. The label is still the same, the “Rating”.
!pip install autokeras
import autokeras as ak
# Preparing the data for autokeras
X_train_ak = np.array(data.loc[X_train.index, 'Review Text'])
y_train_ak = np.array(data.loc[X_train.index, 'Rating'])
X_test_ak = np.array(data.loc[X_test.index, 'Review Text'])
y_test_ak = np.array(data.loc[X_test.index, 'Rating'])
Then, we create a TextRegressor with maximum trials of 3. The AutoKeras will create a maximum of 3 prediction models. The training data is split to have a validation dataset from the 20% of the total training dataset. Then, we can fit the data with 30 epochs for this demonstration.
# Create the model
keras = ak.TextRegressor(overwrite=True, max_trials=3)
# Fit the training dataset
keras.fit(X_train_ak, y_train_ak, epochs=30, validation_split=0.2)
When the AutoKeras is running, it will show the following output cell. But, it will disappear when the process is done.
After the model creation is done, we can export and summarize it. Observe the layers automatically generated by the AutoKeras. The input layer is followed by expand_last_dim, text_vectorization, embedding, dropout, conv1d, max_pooling, flatten, and other layers.
# Show the built models
keras_export = keras.export_model()
keras_export.summary()
Now, let’s see and compare the RMSE of the AutoKeras model. The RMSE is 0.8389. Observing the confusion matrix, the AutoKeras can predict all of the rating levels better. It still cannot predict rating 1 well enough though.
# Predict the test data
pred_keras = keras.predict(X_test_ak)
pred_keras = list(chain(*pred_keras))
pred_keras2 = [i if i <= 5 else 5 for i in pred_keras]
pred_keras2 = [i if i >= 1 else 1 for i in pred_keras2]
pred_keras2 = [round(i) for i in pred_keras2]
As mentioned before, text classification is another type of text analysis. Emotions detection, movie genres classification, and book types classification are examples of text classification. Unlike text regression, text classification does predict a continuous label, but a discrete label. For this exercise, we are going to use the emotion label dataset. The dataset consists of only two columns: text and label. Our task is to create a prediction model to perceive the emotion of the text. The emotions are classified into 4 classes: anger, fear, joy, and sadness.
For text classification, let’s build two Deep Learning models. The first model applies the technic Long Short-Term Memory (LSTM) model. LSTM model is a Deep Learning model under Recurrent Neural Network (RNN). LSTM is a more advanced model compared to the usual multilayer perceptron Deep Learning. This article will not discuss further RNN or LSTM, but will only apply it for text classification. The LSTM for text classifier notebook is available here.
For data preparation, we apply one-hot-encoding to the data label. It creates 4 columns from 1 label column.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.metrics import make_scorer, accuracy_score, classification_report, confusion_matrix
# Load dataset
train = pd.read_csv('/content/emotion-labels-train.csv')
test = pd.read_csv('/content/emotion-labels-test.csv')
# Combine training and test datasets
train = pd.concat([train, test], axis=0)
train.head()
After splitting the training and validation datasets, a tokenizer is applied to tokenize the text. In this demonstration, the tokenizer will keep 5000 most common words. Then, “texts_to_sequences” transforms the texts in the dataset into sequences of integers.
# Apply tokenizer and text to sequence
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
tokenizer = Tokenizer(num_words=5000, oov_token='x')
tokenizer.fit_on_texts(X_train)
tokenizer.fit_on_texts(X_val)
seq_train = tokenizer.texts_to_sequences(X_train)
seq_val = tokenizer.texts_to_sequences(X_val)
pad_train = pad_sequences(seq_train)
pad_val = pad_sequences(seq_val)
As for the “pad_sequences”, let’s just look at this example.
[
[17, 154, 3],
[54, 981, 56, 4],
[20, 8]
]
is applied with pad_sequences to be
[
[17 154 3 0],
[54 981 56 4],
[20 8 0 0]
]
Observe that the commas are removed and zeros are filled to make all the three lists have the same length.
Next, the code below shows LSTM layers creation using keras.Sequential. The LSTM will accept the input dimension of 5000 which is the number for the 5000 most common words from the tokenizer. After some ‘relu’ activation and dropout layers with several neurons, the last layer has 4 neurons with ‘softmax’ activation. This will classify a text into one of the 4 classes. The LSTM is compiled with ‘categorical_loss entropy’ loss function, ‘adam’ optimizer, and ‘accuracy’ as the scoring metrics.
import tensorflow as tf
from keras.callbacks import EarlyStopping
# Create the model
lstm = tf.keras.Sequential([
tf.keras.layers.Embedding(input_dim=5000, output_dim=16),
tf.keras.layers.SpatialDropout1D(0.3),
tf.keras.layers.LSTM(128),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dropout(0.3),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dense(64, activation='relu'),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Dense(4, activation='softmax')
])
lstm.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
The LSTM model will train to find the optimal model in 100 epochs. But, before that, let’s create an early stopping callback. Early stopping is made to avoid spending too much time training the model while the expected target has been achieved or if there is no improvement of the training. In this exercise, the early stopping callback will stop the training before reaching the 100th epochs if a certain target is fulfilled. In this case, I set the target to reach the minimum accuracy of 85.5% for the validation dataset. When this happens, the LSTM model will stop training even though it has not reached 100 epochs. It will also print a message saying that the accuracy has reached more than 85.5%. While fitting the training data, the history is saved in the variable ‘history’.
class earlystop(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if(logs.get('val_accuracy')>0.855):
print("Accuracy has reached > 85.5%!")
self.model.stop_training = True
es = earlystop()
history = lstm.fit(pad_train, y_train, epochs=100, callbacks=[es],
validation_data=(pad_val, y_val), verbose=2, batch_size=100)
Observe the output cell that the LSTM only runs until the 7th epoch and stops. As we have known, the 7th epoch validation dataset has an accuracy of 85.64%. Examine the training and validation datasets’ accuracy below.
Okay, now let’s try to build another model for emotion classification. This time, AutoKeras will be applied. To do this, the code can be similar to the text regressor AutoKeras performed before. Just change the “TextRegressor” into “TextClassifier”, then the rest will work the same. But, the code below will try to perform an advanced text classifier using AutoModel. With AutoModel, we can specify to use TextToIntSequence and Embedding to transform the texts into integer sequences and embed them. We can also specify to use separable convolutional layers. The AutoKeras Text Classifier notebook is available here.
# Create the model
node_input = ak.TextInput()
node_output = ak.TextToIntSequence()(node_input)
node_output = ak.Embedding()(node_output)
node_output = ak.ConvBlock(separable=True)(node_output)
node_output = ak.ClassificationHead()(node_output)
keras = ak.AutoModel(inputs=node_input, outputs=node_output, overwrite=True, max_trials=3)
# Fit the training dataset
keras.fit(X_train_ak, y_train_ak, epochs=80, validation_split=0.2)
Observe the Deep Learning architecture created by the AutoKeras. It has the layers of embedding, separable_conv1d, and classification_head as specified.
# Show the built models
keras_export = keras.export_model()
keras_export.summary()
The model has an accuracy of 0.80. Examine the confusion matrix generated below. It shows that the model can predict the emotion based on the text well. To know which emotion is predicted the best, pay attention to the f1-score. Joy has the highest f1-score, followed by fear, anger, and sadness respectively. It means that the model can predict the emotion of joy the best.
# Predict the validation data
pred_keras = keras.predict(X_val_ak)
# Compute the accuracy
print('Accuracy: ' + str(accuracy_score(y_val_ak, pred_keras)))
NLP aims to analyze a large number of text data. Some examples of the applications are for predicting sentiment and emotion from text using Machine Learning. Unlike numerical data, text data need special preprocessing, like BoW or TF-IDF, before fitting them to Machine Learning algorithms.
A regular expression is one of the basic NLP to find and split sentences or words. Word tokenizer splits sentences into words. After that, each tokenized word can be processed with NER, stemming, and lemmatization. Each word is counted for its frequency in the form of a BoW data frame. Stop words that appear frequently, but do not give any meaning, can be removed. Stop words can be identified with NER.
AutoKeras is a package to perform text regression and classification using Deep Learning. Just inputting the text feature will automatically build the model, including word tokenization and preprocessing, hyperparameter-tuning, and deciding the layers.
A Data Science professional with seasoned specializations in Machine Learning development and Geo-spatial analysis. Hold the TensorFlow Developer Certificate. Have strong work experience in: - delivering meaningful data-driven insights to support business goals, - automating data processing, - data analysis (tabular, time series, text/NLP, and image), - descriptive and inferential statistical analysis, - GIS or spatial data analysis, - data visualization and dashboard development, - Machine Learning modeling (regression, classification, clustering, dimensionality reduction, time series forecasting, recommender engine) - Deep Learning or Artificial Intelligence (regression and classification with MLP, image classification with CNN, time series forecasting with LSTM, text classification with LSTM) - Hugging face: transformers, fine-tuning - Large Language Models (LLM) - Stable Diffusion - web application development, - developing APIs, etc.
Hi Rendyk, This is most complete work I find when working with nlp. I think every problem can be solved with this nlp guide when working with text data. Thank you very much for this compiled work !
We use cookies essential for this site to function well. Please click to help us improve its usefulness with additional cookies. Learn about our use of cookies in our Privacy Policy & Cookies Policy.
Show details
Powered By
Cookies
This site uses cookies to ensure that you get the best experience possible. To learn more about how we use cookies, please refer to our Privacy Policy & Cookies Policy.
brahmaid
It is needed for personalizing the website.
csrftoken
This cookie is used to prevent Cross-site request forgery (often abbreviated as CSRF) attacks of the website
Identityid
Preserves the login/logout state of users across the whole site.
sessionid
Preserves users' states across page requests.
g_state
Google One-Tap login adds this g_state cookie to set the user status on how they interact with the One-Tap modal.
MUID
Used by Microsoft Clarity, to store and track visits across websites.
_clck
Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.
_clsk
Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.
SRM_I
Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
SM
Use to measure the use of the website for internal analytics
CLID
The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.
SRM_B
Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.
_gid
This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.
_ga_#
Used by Google Analytics, to store and count pageviews.
_gat_#
Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.
collect
Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.
AEC
cookies ensure that requests within a browsing session are made by the user, and not by other sites.
G_ENABLED_IDPS
use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.
test_cookie
This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.
_we_us
this is used to send push notification using webengage.
WebKlipperAuth
used by webenage to track auth of webenagage.
ln_or
Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.
JSESSIONID
Use to maintain an anonymous user session by the server.
li_rm
Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.
AnalyticsSyncHistory
Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.
lms_analytics
Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.
liap
Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.
visit
allow for the Linkedin follow feature.
li_at
often used to identify you, including your name, interests, and previous activity.
s_plt
Tracks the time that the previous page took to load
lang
Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings
s_tp
Tracks percent of page viewed
AMCV_14215E3D5995C57C0A495C55%40AdobeOrg
Indicates the start of a session for Adobe Experience Cloud
s_pltp
Provides page name value (URL) for use by Adobe Analytics
s_tslv
Used to retain and fetch time since last visit in Adobe Analytics
li_theme
Remembers a user's display preference/theme setting
li_theme_set
Remembers which users have updated their display / theme preferences
We do not use cookies of this type.
_gcl_au
Used by Google Adsense, to store and track conversions.
SID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SAPISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
__Secure-#
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
APISID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
SSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
HSID
Save certain preferences, for example the number of search results per page or activation of the SafeSearch Filter. Adjusts the ads that appear in Google Search.
DV
These cookies are used for the purpose of targeted advertising.
NID
These cookies are used for the purpose of targeted advertising.
1P_JAR
These cookies are used to gather website statistics, and track conversion rates.
OTZ
Aggregate analysis of website visitors
_fbp
This cookie is set by Facebook to deliver advertisements when they are on Facebook or a digital platform powered by Facebook advertising after visiting this website.
fr
Contains a unique browser and user ID, used for targeted advertising.
bscookie
Used by LinkedIn to track the use of embedded services.
lidc
Used by LinkedIn for tracking the use of embedded services.
bcookie
Used by LinkedIn to track the use of embedded services.
aam_uuid
Use these cookies to assign a unique ID when users visit a website.
UserMatchHistory
These cookies are set by LinkedIn for advertising purposes, including: tracking visitors so that more relevant ads can be presented, allowing users to use the 'Apply with LinkedIn' or the 'Sign-in with LinkedIn' functions, collecting information about how visitors use the site, etc.
li_sugr
Used to make a probabilistic match of a user's identity outside the Designated Countries
MR
Used to collect information for analytics purposes.
ANONCHK
Used to store session ID for a users session to ensure that clicks from adverts on the Bing search engine are verified for reporting purposes and for personalisation
We do not use cookies of this type.
Cookie declaration last updated on 24/03/2023 by Analytics Vidhya.
Cookies are small text files that can be used by websites to make a user's experience more efficient. The law states that we can store cookies on your device if they are strictly necessary for the operation of this site. For all other types of cookies, we need your permission. This site uses different types of cookies. Some cookies are placed by third-party services that appear on our pages. Learn more about who we are, how you can contact us, and how we process personal data in our Privacy Policy.
Hi Rendyk, This is most complete work I find when working with nlp. I think every problem can be solved with this nlp guide when working with text data. Thank you very much for this compiled work !