A Deep Dive Into Transformers Library

Narasimha Last Updated : 07 Sep, 2021

7 min read

This article was published as a part of the Data Science Blogathon

Introduction

We have explored the Pipeline API of the transformers library which can be used for quick inference tasks. You can find more about it in my previous blog post here.

Now let’s go deep dive into the Transformers library and explore how to use available pre-trained models and tokenizers from ModelHub on various tasks like sequence classification, text generation, etc can be used.

So now let’s get started…

To proceed with this tutorial, a jupyter notebook environment with a GPU is recommended. The same can be accessed through Google Colaboratory which provides a cloud-based jupyter notebook environment with a free NVIDIA GPU.

Installation of Transformers Library

First, let’s install the transformers library,

!pip install transformers[sentencepiece]

To install the library in the local environment follow this link.

You should also have an HuggingFace account to fully utilize all the available features from ModelHub.

Getting Started with Transformers Library

We have seen the Pipeline API which takes the raw text as input and gives out model predictions in text format which makes it easier to perform inference and testing on any model. Now let’s explore and understand how the pipeline API works and the different components involved in it.

Tokenizers

Like any other neural network, the transformers also can’t process the raw input text directly, hence it needs to be preprocessed into a form that the model can make sense of. This process is called tokenization where text input is converted into numbers. To do this we use a tokenizer that does the following

Splitting the input text into words, subwords, or individual letters that are called tokens.
Mapping each token with a unique integer.
Arranging and adding required inputs that are useful to the model.

The preprocessing and tokenization process needs to be done in the same way as when the model was trained. Since we are using pre-trained models, we need to use the corresponding tokenizer for the model and this can be achieved by using AutoTokenizer class.

Let’s get the model/checkpoint name from the ModelHub and load the corresponding tokenizer by using the from_pretrained method from AutoTokenizer class.

from transformers import AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

In the above code, we have imported the AutoTokenizer class from the transformers library and initialized the model checkpoint name.

Now let’s initialize the tokenizer,

tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=checkpoint)

When the above code is executed, the tokenizer of the model named distilbert-base-uncased-finetuned-sst-2-english is downloaded and cached for further usage. You can find more info about the model on this model here.

Now we can use this tokenizer directly on the raw text to get a dictionary of tensors to feed it directly to the model. The transformers will only accept tensors as input, hence the tokenizer return the input in tensor form.

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I am very excited about training the model !!",
    "I hate this weather which makes me feel irritated  !"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

Output:

From the above output, we can see that the output is a dictionary containing input_ids and attention_mask keys. The input_ids consist of tokenized inputs and attention_mask contains the 1s and 0s which are used by the attention layer in the transformer defining whether the corresponding token should be attended or not.

Now let’s go through the model…

Model using Transformers Library

We can download the pre-trained model the same way we downloaded the tokenizer.

from transformers import AutoModel
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

We are using the same from_pretrained method from the AutoModel class to load the pre-trained model.

model = AutoModel.from_pretrained(checkpoint)

Similar to the tokenizer, the model is also downloaded and cached for further usage. When the above code is executed, the base model without any head is installed i.e. for any input to the model we will retrieve a high-dimensional vector representing contextual understanding of that input by the Transformer model.

The above transformer model which encoder-decoder architecture is able to extract useful and important features/contextual understanding from the input text, which can be further used for any task like text classification, language modeling, etc…

The output of the model is a high dimensional vector and it has three dimensions,

Batch size: The number of sequences processed at a time.
Sequence Length: The length of each sequence or numerical representation.
Hidden Size: The vector dimension of model input.

We can check the output of our model by following,

outputs = model(**inputs)
print(outputs.last_hidden_state.shape)

Output:

We can get the outputs bypassing the tokenized inputs into the model. From the shape of the output’s hidden state, we can see that our model has a batch size of 3, a sequence length of 16, and a hidden state of size 768.

The above output of the transformer model can be sent a model head to perform our required task.

We can get the model directly from the checkpoint by using different class types of AutoModel i.e

AutoModelForSequenceClassification – This class is used to get a text classification model from the checkpoint.
AutoModelForCasualLM – This class is used to get a language model from the given checkpoint.
AutoModelForQuestionAnswering – This class is used to get a model to perform context-based question answering etc…

There are different architecture available in the Transformers library, each one designed for a specific task.

Now let’s finally look at a Sequence classification task and how it can be done using AutoModel.

Sequence Classification

Now let’s import the AutoModel and AutoTokenizer classes and get our model and tokenizer from the checkpoint.

from transformers import AutoModelForSequenceClassification, AutoTokenizer
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"

Now let’s download the model and get some predictions,

model = AutoModelForSequenceClassification.from_pretrained(checkpoint)

We use the same from_pretrained method we used previously to get our classification model.

raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I am very excited about training the model !!",
    "I hate this weather which makes me feel irritated  !"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

We now have the tokenized inputs ready. Now let’s get the predictions,

outputs = model(**inputs)
print(outputs)

Output:

print(outputs.logits.shape)

Output:

We can see that our model has returned a 3 x 2 matrix, (since we have 3 sequences in input and there are 2 classes).

We can apply the softmax activation to our outputs for easy understanding,

import torch 
outputs = torch.nn.functional.softmax(outputs.logits, dim = -1)
print(outputs)

Output:

We can see that the outputs are passed through softmax activation to get the probabilities of each class for the input sentences.

We have [0.04, 0.95] as the output for the first input, [0.0002, 0.99] as the output for the second input, and finally [0.99, 0.000006] as the output for the third input sample.

If we check the labels of the model by,

model.config.id2label

We get the output as,

Hence we can see that our model is 95% confident that the first input sample belongs to the POSITIVE class, 99% confident that the second input sample belongs to the POSITIVE class, and 99% confident that the third input sample belongs to the NEGATIVE class. We can see that the model’s output is pretty accurate.

Therefore the Pipeline API for Text classification can be written as,

Code:

# Imports 
import torch 
from transformers import AutoModelForSequenceClassification, AutoTokenizer
# Define the model checkpoint
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english"
# Download and cache the tokenizer and classification model
tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=checkpoint)
model = AutoModelForSequenceClassification.from_pretrained(checkpoint)
# Define the inputs and tokenize them
raw_inputs = [
    "I've been waiting for a HuggingFace course my whole life.", 
    "I am very excited about training the model !!",
    "I hate this weather which makes me feel irritated  !"
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)
# Get the outputs from the model
outputs = model(**inputs)
print(outputs)
# Find the class/label probabilities  
outputs = torch.nn.functional.softmax(outputs.logits, dim = -1)
print(outputs)
# Find the label to class mapping for verification
print(model.config.id2label)

Conclusion

We have seen that we can download the pre-trained models and tokenizers and use them directly on our own dataset. From the above example, we have seen that the pre-trained model was able to classify the label/sentiment of input sequences with almost 100% confidence.

Hence we can conclude that these pre-trained models can be used to creating state-of-the-art NLP systems and by fine-tuning it on our own datasets we can be able to get awesome results by yielding higher scores in the required metric.

References:

HuggingFace Course – link
HuggingFace Transformers Documentation – link

Image Sources

Image 1 – discuss.hugging face.co

Link for the colab notebook – https://colab.research.google.com/drive/1OZ-JYvTu12CYppYoPAoydkGp5sDjzuL1?usp=sharing

About the Author:

I’m Narasimha Karthik, Deep Learning Practioner.

I’m a final-year undergraduate student at PES University. Currently working with Computer Vision and NLP. Experience in working with PyTorch, Fastai, Tensorflow, and Keras frameworks. You can contact me through LinkedIn and Twitter for any projects or discussions.

Thank you

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Narasimha

Hi,
I am Narasimha Karthik J, a Data Scientist at Boeing Research in Bengaluru. I have experience in fine-tuning language model models (LLMs) for various domain-specific applications and deploying them. Additionally, I am experienced in LLM training, fine-tuning, RAG, and working with the latest frameworks and technologies.

Thanks and Regards,
Narasimha Karthik J

Advanced Deep Learning Libraries NLP Python

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Abdullah

Hi This is very useful I have a question though. When I pass a large enough list of sentences to the model as input, the kernel repeatedly crashes. The maximum amount it seems to be able to handle is 100 items in the list of sentences. Even if I pass the whole data in batches, the kernel still crashes. Any idea why?

Abe

Hi How can I pass a large list of sentences to the model? When I try to do so, the kernel crashes. Otherwise, if I use a data loader or load the data in batches, it still crashes. The upper value seems to be of a 100 different items.

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

A Deep Dive Into Transformers Library

Introduction

Installation of Transformers Library

Getting Started with Transformers Library

Tokenizers

Model using Transformers Library

Sequence Classification

Conclusion

References:

Image Sources

About the Author:

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory