Disaster Tweet Classification using BERT & Neural Network
This article was published as a part of the Data Science Blogathon
Text classification is one of the most interesting domains today. From chatbot systems to movies recommendations to sentence completion, text classification finds its applications in one form or the other. In this article, we are going to use BERT along with a neural network to build a model that will classify tweets associated with disasters. The dataset has been taken from Kaggle. This article is purely demonstration-based to help you understand how to apply BERT and how easy it is to use BERT (literally).
- You should be aware of the theoretical aspects of BERT. If you are new or would like to brush up on the concepts, I highly recommend you to read this article on Analytics Vidhya first.
- The language used in the demonstration is Python.
The dataset is freely available and has been taken from Kaggle. The file contains over 11,000 tweets associated with disasters. There is a ‘target’ label that denotes whether a tweet is about a real disaster (1) or not (0).
Loading the dataset
For this demonstration, I am using Google Colaboratory. It is a web-based portal where you can run your python script very easily. After downloading the dataset from Kaggle, I have copied the CSV file to colab for further analysis. The basic preliminary step here will be to load the dataset using the pandas package as shown below:
import pandas as pd df = pd.read_csv("spam.csv",encoding='ISO-8859-1') df.head(5)
I have added encoding=’ISO-8859-1′ to resolve an error UnicodeDecodeError: ‘utf-8’ codec can’t decode bytes in position 135-136: invalid continuation byte.
After running the above code, you should be able to see a result similar to this:
Let’s remove unnecessary columns from our dataset namely- ‘id’, ‘keyword’, ‘location’. We are mostly interested in ‘text‘ and ‘target‘ for building the classifier at the moment.
df = df.drop(['id','keyword','location'],axis=1) df.head()
The next step is to see if the dataset is balanced or not. Run the following code to see the count of both class labels (0 and 1):
The output will show you that this dataset is highly imbalanced. There are 9256 entries for label ‘0’ and only 2114 entries for label ‘1’. This is not a fair deal. Hence, in the next section, we will work on balancing this dataset by using the undersampling method.
Balancing the Dataset
As mentioned, this dataset is imbalanced. There are many techniques to balance a dataset like- SMOTE, cluster abundant classes, resampling, and many others. Of these, the simplest is undersampling. In undersampling, the entries of the majority class are dropped till it becomes equal to the number of entries in the minority class. Let us understand this via an example. In our current scenario, the class label distribution is as follows:
Label 0: 9256
Label 1: 2114
According to the undersampling technique, a few entries of label 0 will be dropped since it is a majority class. Hence, at random, 7142 entries from label 0 will be dropped so that both classes are uniform. You may wonder undersampling will result in a loss of data. This is a correct observation. But, here my target is to demonstrate the power of BERT without getting into every detail of balancing the dataset. As an exercise, you should try to apply other dataset balancing techniques and see the effect on your model. For now, we will proceed with undersampling the dataset. The code is as follows:
df_0_class = df[df['target']==0] df_1_class = df[df['target']==1] df_0_class_undersampled = df_0_class.sample(df_1_class.shape) df = pd.concat([df_0_class_undersampled, df_1_class], axis=0)
Splitting the Dataset
If you have worked on any ML/DL-based project before then you are most probably aware of this classic step. We are going to divide the dataset into two parts namely- the training dataset and the test dataset. There is another classification possible known as validation dataset. For the sake of simplicity, I will split the dataset in two using sklearn package. It splits the dataset into train and test datasets in a single line of code making our lives easier! The code is:
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(df['text'],df['target'], stratify=df['target'])
BERT Preprocessor & Encoder
You have reached so far and it’s time to use TensorFlow and build a classification model using BERT. For this example you need to import three more libraries as below:
import tensorflow as tf import tensorflow_hub as hub import tensorflow_text as text
The next step is to download the BERT preprocessor and encoder. There are two ways to do so- manual or via URL. I prefer to download via URL because it is easier and cleaner this way. You can define two variables for preprocessor and encoder respectively. Please look at the code below to understand more:
preprocess = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3") encoder = hub.KerasLayer("https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4")
I recommend you visit the above links. There are two types of BERT based on the encoder. The one used in this tutorial is BERT Basic with 12 encoders. I highly encourage my readers to study the logic behind encoders after this exercise is done. Explanation of encoders is beyond the scope of this article.
Now, all we need to do is create a functional model using BERT layers and Neural network layer as shown below:
text_input = tf.keras.layers.Input(shape=(), dtype=tf.string, name='text-layer') preprocessed_text = bert_preprocess(text_input) outputs = bert_encoder(preprocessed_text) d_layer = tf.keras.layers.Dropout(0.1, name="dropout-layer")(outputs['pooled_output']) d_layer = tf.keras.layers.Dense(1, activation='sigmoid', name="output")(d_layer) model = tf.keras.Model(inputs=[text_input], outputs = [d_layer])
If you look carefully you will see that we have BERT layers (the first three) followed by the neural network layer. We have included a dropout layer and eventually, the output layer will classify whether the given text is disaster news or not.
Upon running model.summary(), you should see something similar to this:
As is with any other Tensorflow based model, we need to compile this newly created model and provide arguments like the optimizer (SGD/Adam/etc), loss type, and metric (accuracy/precision/ recall). This can be done as follows:
m= [ tf.keras.metrics.BinaryAccuracy(name='accuracy'), tf.keras.metrics.Precision(name='precision'), tf.keras.metrics.Recall(name='recall') ] model.compile(optimizer='adam', loss='binary_crossentropy', metrics=m)
Training the model may take a lot of time depending on the computation power that you have. The model is now perfectly ready to learn and run over multiple epochs as:
model.fit(X_train, y_train, epochs=10)
This will result in the accuracy of your model! Voilà, you have successfully trained your model using BERT and neural networks.
A confusion matrix and classification report is a very statistically neat way to understand the performance of your model and take decisions accordingly, in case any improvement is required.
To print confusion matrix use sklearn package like:
import numpy as np y_predicted = model.predict(X_test) y_predicted = y_predicted.flatten() y_predicted = np.where(y_predicted > 0.5, 1, 0) from sklearn.metrics import confusion_matrix, classification_report matrix = confusion_matrix(y_test, y_predicted) matrix
To print classification report:
In this article, we have built a disaster tweet classification model using BERT for text encoding. I hope this article gave you a good hands-on experience of using BERT with the neural network. This will help you to build large complex models to solve even larger classification problems like multi-class classification. In case you face any issues, consider commenting below and I will try my best to resolve the problem.
 Handling imbalanced dataset in Machine Learning – YouTube. (n.d.). Retrieved December 14, 2021, from https://www.youtube.com/watch?v=JnlM4yLFNuo
 S, V. (2020, November 12). Disaster tweets. Kaggle. Retrieved December 14, 2021, from https://www.kaggle.com/vstepanenko/disaster-tweets