Amrutha K — Published On June 6, 2023 and Last Modified On June 15th, 2023
BERT Datasets Guide Intermediate Machine Learning Probability Python Sentiment Analysis


Social media has become a part of our daily life in today’s modern digital era. It provides us with a platform to express our thoughts and opinions. However, it also has its darker side and that is the widespread of fake and hate content. Some people might use social media to spread false information. So this fake and hate probability prediction application can contribute to online safety. Fake hate probability prediction is significant for social media moderation, content filtering, and online security as it helps identify and filter harmful content and combat online harassment, discrimination, and misinformation, creating a safer and more inclusive online environment. In this article, we will build a multi-task model for fake and hate probability prediction using BERT.

Use machine learning models to train a single task. But imagine you have to build a model for sentiment analysis. This model aims to classify the sentiment such as positive, negative, or neutral in the given text as well as emotions in the text such as anger, sadness, or happy like that. We will train the model separately as two different tasks. Instead, we can train our model once for both tasks. Let’s explore more about it in this article, shall we?

 Source: Packt Hub | probability prediction | BERT | Multi-task model
Source: Packt Hub

Learning Objectives

In this article, we will learn

  • About multi-task learning and their types
  • Challenges of using multi-task learning
  • Building a multi-task model to predict fake and hate probabilities using BERT
  • How to create attention masks, padding and truncating, and many more?

This article was published as a part of the Data Science Blogathon.

Multi-Task Learning

Multi-Task Learning (MTL) is a technique in machine learning where you train the model for multiple tasks. And these tasks should be related to each other. It uses shared representations to improve the performance of the model. It learns to perform multiple tasks at once. We can particularly use this when we need to perform multiple related tasks but do not have enough individual data to train. MTL architecture shares the same lower-level features across tasks while learning task-specific higher-level features. Multiple task-specific layers constitute it, and it connects to a shared layer. So these task-specific layers use shared features to solve their respective tasks. It has many applications in various fields including Natural Language Processing (NLP), Computer vision, speech recognition, etc.

 Source: Researchgate | probability prediction | BERT | Multi-task model
Source: Researchgate

For example, take social media platforms where comments, reviews, etc are generated. To classify these texts for better understanding, we need a model that will tell the sentiment and emotion of the text. These are the two tasks require to build a model. So these tasks use shared parameters and improve the performance of the model. Detecting the emotion of a post may require understanding the sentiment of the text, and vice versa. The training dataset will contain both sentiment and emotion for every post and the model trains accordingly. During training, the model learns to predict both the sentiment and emotion of each post simultaneously, using a shared representation of the input text.

Types of Multi-Task Learning

Some of the different types of multi-task learnings are as follow:

Hard Parameter Sharing

Train the neural network by sharing the same set of parameters for all the tasks. Here it assumes that the input features are common for all the tasks. The biggest advantage of it is simplicity. Sharing the same parameters enables it to train more efficiently with reduced parameters, which prevents overfitting. But it is not suitable for tasks that are different as it is difficult to find shared parameters.

Soft Parameter Sharing

This approach differentiates itself from hard parameter sharing by training each task in the neural network with its own set of parameters. Here, the model shares some parameters while also learning task-specific parameters. This technique finds application in various domains such as NLP and computer vision, where it enables the model to learn task-specific representations while leveraging shared parameters. It is particularly useful when the input features are similar but not identical.


In this, it uses the attention mechanism which means that the model focuses on certain parts of data that are important and ignores others. For attention-based MTL, selectively focus on task-specific features while training. It allows the model to learn task-specific representations while benefiting from shared parameters.


Though it has many advantages like better performance, improved generalization, and reduced complexity, however, it also poses some challenges.

  • It requires a sufficient amount of data to train. As it learns on multiple tasks, if any of its tasks have limited data, it may result in incorrect results.
  • As we train multiple tasks together, training one task may negatively impact the other task.
  • It requires a complex architecture as it shares layers between them and it may be computationally expensive.
  • If there are tasks with different complexity then the model will give priority to the easier one and neglects the difficult one. This may result in the bad performance of the model.
  • It requires more computational resources compared to single-task learning.


Now we will build a model that will predict fake and hate probabilities using Multi-Task learning.


In this project, we will use a fake hate dataset. Download it from here.

Social media platforms offer an extensive range of user-generated content and perspectives. This dataset is a collection of text sentences taken from various social media platforms. It has four columns in total. one is the text column which contains text sentences in Hinglish. The other three columns are label_f,label_h, and label_s denoting fake, hate, and sentiment respectively. Every text is multi-labeled. Here 1 represents true and 0 represents false. For example, if the text sentence is labeled 1 for hate then it means the text has hatred in it.

probability prediction | BERT | Multi-task model

Let’s start by importing some dependencies. In this project, we will use BertTokenizer for tokenizing texts and BertModel which is a pre-trained model based on BERT architecture. We also use a data loader that loads data in batches and enables efficient processing during training and evaluation.

import pandas as pd
import numpy as np
import torch
from transformers import BertTokenizer, BertModel
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from import TensorDataset
from import DataLoader
from import RandomSampler
from import SequentialSampler

Import the dataset file and create a data frame. And then shuffle the entire dataset and reset indexes after shuffling by discarding old indexes.

df = pd.read_csv('path to dataset') 
df = df.sample(frac=1).reset_index(drop=True) # Shuffle the dataset

Rename columns of the data frame. The column ‘label_f’ is renamed to ‘fake’, column ‘label_h’ is renamed to ‘hate’, and column ‘label_s’ is renamed to ‘sentiment’.


Now we have to define task-specific labels. Here we have three tasks in total. So we are defining three labels. fake_labels, hate_labels and sentiment_labels. We are extracting values from respective columns and converting them into numpy arrays.

# Define Task-specific Labels
fake_labels = np.array(df['fake'])
hate_labels = np.array(df['hate'])
sentiment_labels = np.array(df['sentiment'])


The next step is to tokenize texts. we will use BertTokenizer. Initialize a tokenizer using the BERT-base-uncased pre-trained model. We loaded the tokenizer from the Hugging Face Transformers library.

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
tokenized_texts = [tokenizer.encode(text, add_special_tokens=True) for text in df['text']]

View a random text and tokenize it.

# rajneeti ko gandhwa diya ha in sapa congress ne I hate this type of rajneeti

Next, we have to perform some preprocessing steps like splitting the dataset into train and test sets, creating attention masks, and finally padding and truncating.

Splitting the Dataset

We will use the train_test_split function from sklearn with a test size of 0.2 for splitting the dataset. This means 20% of the dataset is randomly split for testing and 80% for training.

Attention Masks

We will create attention masks to indicate which tokens are actual tokens and which are padding tokens. In this step, we create a binary tensor with the same shape as the input sequence, serving as an attention mask. The tokens with a value of 1 represent actual tokens, while tokens with a value of 0 represent padding tokens. Using attention masks, the model will only focus on relevant information and helps improve the models’ efficiency and effectiveness.

Padding and Truncation

Neural networks typically require fixed-length input sequences for efficient processing. So, to ensure the same fixed length for all input sequences we use padding and truncation techniques. Use padding for sequences whose length is less than the maximum length specified. In padding, we add extra padding tokens at the end of the sequence. Use truncation for sequences whose length is more than the maximum length specified. In truncation, we will remove the last tokens of the sequence and brings it to maximum length.

In the below picture, you can see how a text sequence will look after padding.

from keras.utils import pad_sequences

MAX_LEN = 256 # Define the maximum length of tokenized texts
from sklearn.model_selection import train_test_split

# Split the data into train and test sets
train_inputs, test_inputs, train_fake_labels, test_fake_labels, 
train_hate_labels, test_hate_labels, train_sentiment_labels, 
test_sentiment_labels = train_test_split(input_ids, fake_labels, hate_labels, 
                        sentiment_labels, random_state=42, test_size=0.2)

# Create attention masks
train_masks = [[int(token_id > 0) for token_id in input_id] for input_id in train_inputs]
test_masks = [[int(token_id > 0) for token_id in input_id] for input_id in test_inputs]

# Pad and truncate the input_ids and attention_mask to a fixed length
max_length = 256
train_inputs = pad_sequences(train_inputs, maxlen=max_length, dtype='long', 
                             value=0, truncating='post', padding='post')
test_inputs = pad_sequences(test_inputs, maxlen=max_length, dtype='long', 
                             value=0, truncating='post', padding='post')
train_masks = pad_sequences(train_masks, maxlen=max_length, dtype='long', 
                             value=0, truncating='post', padding='post')
test_masks = pad_sequences(test_masks, maxlen=max_length, dtype='long', 
                             value=0, truncating='post', padding='post')


A DataLoader is a PyTorch utility that facilitates efficient data loading and batching during the training or evaluation of a machine learning model.  It provides an iterable over a dataset and automatically handles various aspects of data processing, such as batching, shuffling, and parallel data loading. Each iteration of the loop returns a batch of input samples and their corresponding labels, which can be fed into the model for processing.

First, we defined batch size with 32. This means the entire dataset will be processed in the form of batches of size 32. The training dataset is converted to a TensorDataset object and it holds training input sequences, training attention masks, fake labels, hate labels, and sentiment labels. Then using RandomSampler, train_sampler is created for creating random samples from the training dataset and creating random batches during training. A training data loader is created using the training dataset and the random sampler. This data loader will provide batches of data for training.

Similarly, a test data loader is created using test data and a test sampler.

#Define Dataloader
batch_size = 32

train_data = TensorDataset(torch.tensor(train_inputs), torch.tensor(train_masks), 
                           torch.tensor(train_fake_labels), torch.tensor(train_hate_labels),
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=batch_size)

test_data = TensorDataset(torch.tensor(test_inputs), torch.tensor(test_masks), 
                          torch.tensor(test_fake_labels), torch.tensor(test_hate_labels),
test_sampler = SequentialSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=batch_size)

Multi-Task Model

Now we have to create a multi-task model for multi-label classification using BERT(Bidirectional Encoder Representations from Transformers) model. The ‘bert’ attribute is initialized with the BERT model pre-trained on the “bert-base-uncased” model. Then a dropout layer is added with a dropout rate of 0.1. Dropout is a regularization technique that randomly sets a fraction of input units to 0 during training to prevent overfitting.

Then we defined three linear classifiers. They are ‘fake_classifier’, ‘hate_classifier’, and ‘sentiment_classifier’. These three classifiers perform their respective tasks and produce logits for two classes. Then these are processed through softmax functions that convert logits to probabilities. The  fake_softmax, hate_softmax, and sentiment_softmax are three softmax functions used for three classifiers respectively.

The model takes input_ids and attention_mask as inputs and returns the logits and probabilities for the three tasks: fake classification, hate classification, and sentiment classification. This multi-task model allows for joint training and prediction of multiple classification tasks using a shared BERT backbone, which can capture contextual information and improve performance across different tasks.

# Define Multi-task Model
import torch.nn as nn
from transformers import BertModel

class MultiTaskModel(nn.Module):
    def __init__(self):
        super(MultiTaskModel, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.dropout = nn.Dropout(0.1)
        self.fake_classifier = nn.Linear(768, 2)
        self.hate_classifier = nn.Linear(768, 2)
        self.sentiment_classifier = nn.Linear(768, 2)
        self.fake_softmax = nn.Softmax(dim=1)
        self.hate_softmax = nn.Softmax(dim=1)
        self.sentiment_softmax = nn.Softmax(dim=1)

    def forward(self, input_ids, attention_mask):
      outputs = self.bert(input_ids, attention_mask=attention_mask)
      pooled_output = outputs[1]
      pooled_output = self.dropout(pooled_output)

      fake_logits = self.fake_classifier(pooled_output)
      hate_logits = self.hate_classifier(pooled_output)
      sentiment_logits = self.sentiment_classifier(pooled_output)

      fake_probs = self.fake_softmax(fake_logits)
      hate_probs = self.hate_softmax(hate_logits)
      sentiment_probs = self.sentiment_softmax(sentiment_logits)

      return fake_logits, hate_logits, sentiment_logits, fake_probs , hate_probs, sentiment_probs

Let’s define the loss function and optimizer for training the multi-task model. The cross-entropy loss function, utilize it for multi-class classification tasks, and employ it. We will use an Adam optimizer with a learning rate of 2e-5. This is responsible for updating the model’s parameters during training based on the computed gradients.

# Define Loss Function and Optimizer
model = MultiTaskModel()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=2e-5)


It’s time to train our multi-task model. This will produce logits and probabilities for all tasks and losses are calculated for each task. The sum of these losses gives the overall loss of the model.

from transformers import AdamW
optimizer = AdamW(model.parameters(), lr=2e-5, eps=1e-8)
criterion = nn.CrossEntropyLoss()

for epoch in range(10):
    for step, batch in enumerate(train_dataloader):
        input_ids = batch[0].to(device)
        attention_mask = batch[1].to(device)
        fake_labels = batch[2].to(device)
        hate_labels = batch[3].to(device)
        sentiment_labels = batch[4].to(device)


        fake_logits, hate_logits, sentiment_logits, fake_probs , 
        hate_probs,sentiment_probs = model(input_ids, attention_mask)

        fake_loss = criterion(fake_logits, fake_labels)
        hate_loss = criterion(hate_logits, hate_labels)
        sentiment_loss = criterion(sentiment_logits, sentiment_labels)

        loss = fake_loss + hate_loss + sentiment_loss


        print(f"Epoch: {epoch}, Step: {step}, Loss: {loss.item()}")

After training save the trained model and the tokenizer associated with it. So you don’t have to train every time you use the model. You just have to load them to reuse., 'path/model.pth'){'tokenizer': tokenizer}, 'path/model_info.pth')

Let’s see how it works. For this, you have to load both model and tokenizer associated with it.

import torch

# Load the model architecture and additional information
model_info = torch.load('path/model_info.pth')
tokenizer = model_info['tokenizer']

# Create an instance of the model class
new_model = MultiTaskModel()

# Load the saved model weights


Let’s Evaluate our model on the test dataset. Create an empty list for storing all the predictions. Iterate over the test data loader to get batches of test data. We obtain logits and apply the softmax function to convert them into probabilities. Then, we append the predictions list with the text and all probabilities.

predictions = []
with torch.no_grad():
    for batch in test_dataloader:
        batch = tuple( for t in batch)
        input_ids, attention_mask, fake_labels, hate_labels, sentiment_labels = batch
        fake_logits, hate_logits, sentiment_logits, fake_probs1 , hate_probs1, sentiment_probs1= 
        model(input_ids, attention_mask)
        fake_probs = nn.Softmax(dim=1)(fake_logits)
        hate_probs = nn.Softmax(dim=1)(hate_logits)
        sentiment_probs = nn.Softmax(dim=1)(sentiment_logits)
        for i in range(len(fake_probs)):
                'text': tokenizer.decode(input_ids[i]),
                'fake': fake_probs[i].tolist(),
                'hate': hate_probs[i].tolist(),
                'sentiment': sentiment_probs[i].tolist()

Let’s view predictions where it has text, fake probabilities, hate probabilities, and sentiment probabilities. The first value is the probability of being true for that particular label. For example, in the first text of the following figure, the probability of the text being fake if 0.6766, and the probability of being not fake is 0.3233. Similarly, remaining labels.

for i in range(len(predictions)):
    print('Text: {}'.format(predictions[i]['text']))
    print('Fake Probabilities: {}'.format(predictions[i]['fake']))
    print('Hate Probabilities: {}'.format(predictions[i]['hate']))
    print('Sentiment Probabilities: {}'.format(predictions[i]['sentiment']))


Multi-task learning is a powerful technique to train multiple tasks. We explored multi-task learning in this article, including its types and its challenges. One of the key aspects we focused on was building a multi-task model using BERT for predicting fake and hate probabilities. We walked through the steps involved in preparing the dataset, including tokenization, splitting, and creating attention masks. Additionally, we learned how to handle padding and truncation to ensure consistent input lengths.

  • Multi-task learning allows the model to perform multiple tasks simultaneously by sharing the learned representations across tasks.
  • Multi-task learning models come in a variety of types, and each one has advantages and disadvantages of its own.
  • Machine learning is becoming advanced and more prevalent in various fields. This will surely play an important role in developing efficient models using BERT.
  • Fake hate probability prediction models help in identifying and filtering out harmful and false content and reduce its impact. Also promotes a safer online environment.
  • As the field of NLP continues to evolve, MTL holds great promise for pushing the boundaries of what is possible.
  • The future of MTL is unpredictable. It has already proved its potential in various fields as per now.

Hope you found this article useful. Connect with me on LinkedIn.

Frequently Asked Questions

Q1. What is a multitask model?

A. A multitask model is a type of machine learning model that is trained to perform multiple related tasks simultaneously. It shares a common underlying representation across tasks, allowing for improved efficiency and knowledge transfer between tasks, leading to better overall performance.

Q2. What is BERT used for?

A. BERT, which stands for Bidirectional Encoder Representations from Transformers, is a state-of-the-art natural language processing (NLP) model. It is used for various NLP tasks such as text classification, named entity recognition, and question-answering. BERT learns contextual word representations by training on a large corpus of text data.

Q3. Why is BERT better than LSTM?

A. BERT is considered better than LSTM (Long Short-Term Memory) in certain NLP tasks due to its ability to capture bidirectional context and handle long-range dependencies. BERT models pretrain on large-scale unlabeled text data, enabling them to capture richer semantic representations compared to LSTM, which is trained sequentially.

Q4. How do you predict using the BERT model?

A. To predict using the BERT model, the input text is first tokenized into subword units. These tokens are then fed into the BERT model, which processes them through multiple transformer layers. The output representations from the model can be used for various downstream tasks, such as classification or named entity recognition, by adding task-specific layers on top and fine-tuning the model on labeled data.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.