Exploring the Use of LLMs and BERT for Language Tasks

Sakshi Raheja Last Updated : 11 Apr, 2024

10 min read

Introduction

In the rapidly evolving landscape of artificial intelligence, especially in NLP, large language models (LLMs) have swiftly transformed interactions with technology. Since the groundbreaking ‘Attention is all you need’ paper in 2017, the Transformer architecture, notably exemplified by ChatGPT, has become pivotal. GPT-3, a prime example, excels in generating coherent text. This article explores leveraging LLMs with BERT for tasks through pre-training, fine-tuning, and prompting, unraveling the keys to their exceptional performance.

Prerequisites: Knowledge of Transformers, BERT, and Large Language Models.

Take your AI innovations to the next level with GenAI Pinnacle. Fine-tune models like Gemini and unlock endless possibilities in NLP, image generation, and more. Dive in today! Explore Now

Introduction
What are LLMs?
Ways to Train Large Language Models
Finetuning Technique
Finetuning BERT
Conclusion
Frequently Asked Questions

What are LLMs?

LLM stands for Large Language Model. LLMs are deep learning models designed to understand the meaning of human-like text and perform various tasks such as sentiment analysis, language modeling(next-word prediction), text generation, text summarization, and much more. They are trained on a huge amount of text data.

We use applications based on these LLMs daily without even realizing it. Google uses BERT(Bidirectional Encoder Representations for Transformers) for various applications such as query completion, understanding the context of queries, outputting more relevant and accurate search results, language translation, and more.

Deep learning techniques, specifically deep neural networks and advanced methods like self-attention, underpin the construction of these models. They learn the language’s patterns, structures, and semantics by training on extensive text data. Given their reliance on enormous datasets, training them from scratch consumes substantial time and resources, rendering it impractical.

There are techniques by which we can directly use these models for a specific task. So let’s discuss them in detail!

Ways to Train Large Language Models

While we can train these models to perform a specific task by conventional fine-tuning, there are other simple approaches as well that are possible now, but before that, let’s discuss the pre-training of LLM.

Pretraining

In pretraining, a vast amount of unlabeled text serves as the training data for a large language model. The question is, ‘How can we train a model on unlabeled data and then expect the model to predict the data accurately?’. Here comes the concept of ‘Self-Supervised Learning.’ In self-supervised learning, a model masks a word and tries to predict the next word with the help of the preceding words.

E.g. Suppose we have a sentence: ‘I am a data scientist’.

The model can create its own labeled data from this sentence like:

Text	Label
I	am
I am	a
I am a	data
I am a data	Scientist

This is next-word prediction, and the models are auto-regressive. This can be done by an MLM (Masked Language Model). BERT, a masked language model, uses this technique to predict the masked word. We can think of MLM as a `fill in the blank` concept, in which the model predicts what word can fit in the blank.

There are different ways to predict the next word, but we only talk about BERT, the MLM, for this article. BERT can look at both the preceding and the succeeding words to understand the context of the sentence and predict the masked word.

So, as a high-level overview of pre-training, it is a technique in which the model learns to predict the next word in the text.

Finetuning

Finetuning is tweaking the model’s parameters to make it suitable for performing a specific task. After pretraining, the model undergoes fine-tuning, where you train for specific tasks like sentiment analysis, text generation, and finding document similarity, to name a few. We don’t have to train the model again on a large text. Rather, use the trained model to perform a task we want to perform. We will discuss how to finetune a Large Language Model in detail later in this article.

Prompting

Prompting is the easiest of all the 3 techniques but a bit tricky. It involves giving the model a context(Prompt) based on which the model performs tasks.

Think of it as teaching a child a chapter from their book in detail, being very discreet about the explanation, and then asking them to solve the problem related to that chapter.

In context to LLM, take, for example, ChatGPT. We set a context and ask the model to follow the instructions to solve the problem given.

Suppose I want ChatGPT to ask me to interview questions on Transformers only.

For a better experience and accurate output, you need to set a proper context and give a detailed task description.

Example:

A Data Scientist with 2 years of experience and preparing for a job interview at XYZ company. I love problem-solving, and currently working with state-of-the-art NLP models. I am up to date with the latest trends and technologies. Ask me very tough questions on the Transformer model that the interviewer of this company can ask based on the company’s previous experience. Ask me 10 questions and also give the answers to the questions.

The more detailed and specific you prompt, the better the results. The most fun part is that you can generate the prompt from the model itself and then add a personal touch or the information needed.

Finetuning Technique

There are different ways to finetune a model conventionally, and the different approaches depend on the specific problem you want to solve. Let’s discuss the techniques to fine-tune a model.

There are 3 ways of conventionally finetuning an LLM.

Feature Extraction: This technique is used to extract the features from a given text, but why would we want to extract embeddings from a given text? The answer is very simple. Since computers do not understand text, there must be some representation of the text which can be used to perform different tasks. Once the embeddings are extracted, they can analyze sentiment, find document similarity, etc. In feature extraction, the backbone layers of the model are frozen, i.e., the parameters of those layers are not updated, and only the parameters of the classifier layers are updated. The classifier layers involve the fully connected network of layers.
Full Model Finetuning: As the name suggests, this technique trains each model layer on the custom dataset for several epochs. The parameters of all the layers in the model are adjusted according to the new custom dataset. This can improve the model’s accuracy on the data and the specific task we want to perform. It is computationally expensive and takes a lot of time for the model to train, considering there are billions of parameters in the LLM.
Adapter-Based Finetuning: Adapter-based finetuning is a comparatively new concept in which an additional randomly initialized layer or a module is added to the network, which is then trained for a specific task. In this technique, the parameters of the model are left undisturbed or the parameters of the model are not changed or tuned. Rather, the adapter layer parameters are trained. This technique helps in tuning the model in a computationally efficient manner.

Finetuning BERT

Now that we know the finetuning techniques let’s perform sentiment analysis on the IMDB movie reviews using BERT. BERT is a large language model that combines transformer layers and is encoder-only. Google developed it and has proven to perform very well on various tasks. BERT comes in different sizes and variants like BERT-base-uncased, BERT Large, RoBERTa, LegalBERT, and many more.

Let’s use the BERT model to perform sentiment analysis on IMDB movie reviews. For free GPU availability, it is recommended to use Google Colab. Let us start the training by loading some important libraries. Since BERT (Bidirectional Encoder Representations for Encoders) is based on Transformers, the first step would be to install transformers in our environment.

!pip install transformers

Let’s load some libraries that will help us to load the data as required by the BERT model, tokenize the loaded data, load the model we will use for classification, perform train-test-split, load our CSV file, and some more functions.

import pandas as pd

import numpy as np

import os

from sklearn.model_selection import train_test_split

import torch

import torch.nn as nn

from transformers import BertTokenizer, BertModel

We have to change the device from CPU to GPU for faster computation.

device = torch.device("cuda")

The next step would be to load our dataset and look at the first 5 records in the dataset.

df = pd.read_csv('/content/drive/MyDrive/movie.csv')

df.head()

Training and Validation Sets

We will split our dataset into training and validation sets. You can also split the data into train, validation, and test sets, but for the sake of simplicity, I am just splitting the dataset into training and validation.

x_train, x_val, y_train, y_val = train_test_split(df.text, df.label, random_state = 42, test_size = 0.2, stratify = df.label)

Let us import and load the BERT model and tokenizer.

from transformers.models.bert.modeling_bert import BertForSequenceClassification

# import BERT-base pre-trained model

BERT = BertModel.from_pretrained('bert-base-uncased')

# Load the BERT tokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

We will use the tokenizer to convert the text into tokens with a maximum length of 250 and padding and truncation when required.

train_tokens = tokenizer.batch_encode_plus(x_train.tolist(), max_length = 250, pad_to_max_length=True, truncation=True)

val_tokens = tokenizer.batch_encode_plus(x_val.tolist(), max_length = 250, pad_to_max_length=True, truncation=True)

The tokenizer returns a dictionary with three key-value pairs containing the input_ids, which are the tokens relating to a particular word; token_type_ids, which is a list of integers that distinguish between different segments or parts of the input; and attention_mask, which indicates which token to attend to.

Converting these values into tensors

train_ids = torch.tensor(train_tokens['input_ids'])

train_masks = torch.tensor(train_tokens['attention_mask'])

train_label = torch.tensor(y_train.tolist())

val_ids = torch.tensor(val_tokens['input_ids'])

val_masks = torch.tensor(val_tokens['attention_mask'])

val_label = torch.tensor(y_val.tolist())

Loading TensorDataset and DataLoaders to preprocess the data further and make it suitable for the model.

from torch.utils.data import TensorDataset, DataLoader

train_data = TensorDataset(train_ids, train_masks, train_label)

val_data = TensorDataset(val_ids, val_masks, val_label)

train_loader = DataLoader(train_data, batch_size = 32, shuffle = True)

val_loader = DataLoader(val_data, batch_size = 32, shuffle = True)

Our task is to freeze the parameters of BERT using our classifier and then fine-tune those layers on our custom dataset. So, let’s freeze the parameters of the model.

for param in BERT.parameters():

 param.requires_grad = False

Now, we will have to define the forward and the backward pass for the layers that we have added. The BERT model will act as a feature extractor while we will have to define the forward and backward passes for classification explicitly.

class Model(nn.Module):

   def __init__(self, bert):

       super(Model, self).__init__()

       self.bert = bert

       self.dropout = nn.Dropout(0.1)

       self.relu = nn.ReLU()

       self.fc1 = nn.Linear(768, 512)

       self.fc2 = nn.Linear(512, 2)

       self.softmax = nn.LogSoftmax(dim=1)

   def forward(self, sent_id, mask):

       # Pass the inputs to the model

       outputs = self.bert(sent_id, mask)

       cls_hs = outputs.last_hidden_state[:, 0, :]

       x = self.fc1(cls_hs)

       x = self.relu(x)

       x = self.dropout(x)

       x = self.fc2(x)

       x = self.softmax(x)

       return x

Let’s move the model to GPU.

model = Model(BERT)

# push the model to GPU

model = model.to(device)

Defining the optimizer

# optimizer from hugging face transformers

from transformers import AdamW

# define the optimizer

optimizer = AdamW(model.parameters(),lr = 1e-5)

We have preprocessed the dataset and defined our model. Now is the time to train the model. We have to write a code to train and evaluate the model.

The train function:

def train():

   model.train()

   total_loss, total_accuracy = 0, 0

   total_preds = []

   for step, batch in enumerate(train_loader):

       # Move batch to GPU if available

       batch = [item.to(device) for item in batch]

       sent_id, mask, labels = batch

       # Clear previously calculated gradients

       optimizer.zero_grad()

       # Get model predictions for the current batch

       preds = model(sent_id, mask)

       # Calculate the loss between predictions and labels

       loss_function = nn.CrossEntropyLoss()

       loss = loss_function(preds, labels)

       # Add to the total loss

       total_loss += loss.item()

       # Backward pass and gradient update

       loss.backward()

       optimizer.step()

       # Move predictions to CPU and convert to numpy array

       preds = preds.detach().cpu().numpy()

       # Append the model predictions

       total_preds.append(preds)

   # Compute the average loss

   avg_loss = total_loss / len(train_loader)

   # Concatenate the predictions

   total_preds = np.concatenate(total_preds, axis=0)

   # Return the average loss and predictions

   return avg_loss, total_preds

The evaluation function:

def evaluate():

   model.eval()

   total_loss, total_accuracy = 0, 0

   total_preds = []

   for step, batch in enumerate(val_loader):

       # Move batch to GPU if available

       batch = [item.to(device) for item in batch]

       sent_id, mask, labels = batch

       # Clear previously calculated gradients

       optimizer.zero_grad()

       # Get model predictions for the current batch

       preds = model(sent_id, mask)

       # Calculate the loss between predictions and labels

       loss_function = nn.CrossEntropyLoss()

       loss = loss_function(preds, labels)

       # Add to the total loss

       total_loss += loss.item()

       # Backward pass and gradient update

       loss.backward()

       optimizer.step()

       # Move predictions to CPU and convert to numpy array

       preds = preds.detach().cpu().numpy()

       # Append the model predictions

       total_preds.append(preds)

   # Compute the average loss

   avg_loss = total_loss / len(val_loader)

   # Concatenate the predictions

   total_preds = np.concatenate(total_preds, axis=0)

   # Return the average loss and predictions

   return avg_loss, total_preds

Train the Model

We will now use these functions to train the model:

# set initial loss to infinite

best_valid_loss = float('inf')

#defining epochs

epochs = 5

# empty lists to store training and validation loss of each epoch

train_losses=[]

valid_losses=[]

#for each epoch

for epoch in range(epochs):

   print('\n Epoch {:} / {:}'.format(epoch + 1, epochs))

   #train model

   train_loss, _ = train()

   #evaluate model

   valid_loss, _ = evaluate()

   #save the best model

   if valid_loss < best_valid_loss:

       best_valid_loss = valid_loss

       torch.save(model.state_dict(), 'saved_weights.pt')

   # append training and validation loss

   train_losses.append(train_loss)

   valid_losses.append(valid_loss)

   print(f'\nTraining Loss: {train_loss:.3f}')

   print(f'Validation Loss: {valid_loss:.3f}')

And there you have it. You can use your trained model to infer any data or text you choose.

Also Read: Why and how to use BERT for NLP Text Classification?

Conclusion

This article explored the world of LLMs and BERT and their significant impact on natural language processing (NLP). We discussed the pretraining process, where LLMs are trained on large amounts of unlabeled text using self-supervised learning. We also delved into finetuning, which involves adapting a pre-trained model for specific tasks and prompting, where models are provided with context to generate relevant outputs. Additionally, we examined different finetuning techniques, such as feature extraction, full model finetuning, and adapter-based finetuning. LLMs have revolutionized NLP and continue to drive advancements in various applications.

Key Takeaways

LLMs, such as BERT, are powerful models trained on vast amounts of text data, enabling them to understand and generate human-like text.
Pretraining involves training LLMs on unlabeled text using self-supervised learning techniques like masked language modeling (MLM).
Finetuning is adapting a pre-trained LLM for specific tasks by extracting features, training the entire model, or using adapter-based techniques, depending on the requirements.

Dive into the future of AI with GenAI Pinnacle. From training bespoke models to tackling real-world challenges like PII masking, empower your projects with cutting-edge capabilities. Start Exploring.

Frequently Asked Questions

Q1. How do LLMs and BERT understand the meaning of text without explicit labels?

A. LLMs employ self-supervised learning techniques like masked language modeling, where they predict the next word based on the context of surrounding words, effectively creating labeled data from unlabeled text.

Q2. What is the purpose of finetuning LLMs?

A. Finetuning allows LLMs to adapt to specific tasks by adjusting their parameters, making them suitable for sentiment analysis, text generation, or document similarity tasks. It builds upon the pre-trained knowledge of the model.

Q3. What is the significance of prompting in LLMs?

A. Prompting involves providing context or instructions to LLMs to generate relevant outputs. Users can guide the model to answer questions, generate text, or perform specific tasks based on the given context by setting a specific prompt.

Master the forefront of GenAI technology with our Generative AI pinnacle program, wherein you will dive into 200+ hours of in-depth learning and get exclusive 75+ mentorship sessions. Check it out now and get a clear roadmap for your dream job!

Sakshi Raheja

Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Maecenas porttitor congue massa. Fusce posuere, magna sed pulvinar ultricies, purus lectus malesuada libero, sit amet commodo magna eros quis urna. Nunc viverra imperdiet enim. Fusce est. Vivamus a tellus. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas.

BERT ChatGPT Generative AI Intermediate LLMs

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

Exploring the Use of LLMs and BERT for Language Tasks

Introduction

Table of contents

What are LLMs?

Ways to Train Large Language Models

Pretraining

Finetuning

Prompting

Finetuning Technique

Finetuning BERT

Training and Validation Sets

Train the Model

Conclusion

Key Takeaways

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang

s_tp

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

s_pltp

s_tslv