How to Build NLP Applications with Hugging Face?

Badrinarayan M Last Updated : 12 Jun, 2024

12 min read

Introduction

Hugging Face (HF) is a pioneering AI platform enabling ML community collaboration on models, datasets, and applications. This article will delve into Hugging Face’s capabilities for building NLP applications, covering key services such as models, datasets, and open-source libraries. Whether you are a beginner or an experienced developer, Hugging Face offers versatile tools to enhance your NLP journey.

Overview

Learn about Hugging Face for building NLP applications using models, datasets, and open-source tools.
Explore Hugging Face’s core services, which include a wide array of models, comprehensive datasets, and essential open-source NLP libraries.
Using Hugging Face’s tools, discover practical NLP applications such as text classification, text summarization, generation, translation, etc.
Learn how to leverage popular Hugging Face libraries like Transformers to develop and fine-tune models for various natural language processing tasks.

What is Hugging Face(HF)?
Models in Hugging Face
Datasets in Hugging Face
Open Source Libraries and Docs
Transformers Library
Various Functionalities Available in HF for NLP
How to Build NLP Applications Using Hugging Face
Frequently Asked Questions

What is Hugging Face(HF)?

Hugging Face has quite a few offerings as an AI platform. Simply put, it is a platform where the ML community collaborates on models, datasets, and applications.

Let’s get started with Hugging Face. Some of the core hugging face services are:

Models
Datasets
Spaces
Open Source Libraries and Docs
Community

Models in Hugging Face

Hugging Face hosts many open-source models, such as LLMs, diffusion-based text-to-image models, audio models, and much more! A key advantage of using Hugging Face for this is its CLI tool, which is designed for large model files.

Model pages can also include lots of valuable tools and information. Some models will have direct links to run inference or host the model on a space.

Models in Hugging Face | Build NLP Applications with Hugging face

Datasets in Hugging Face

Similar to Models, Hugging Face also hosts datasets for training and evaluation. This can include text data sets, audio data, image data, and more!

Open Source Libraries and Docs

Hugging Face creates, manages, and documents many open-source libraries that are popular in the ML space, such as:

Transformers
Diffusers
Gradio
Accelerate

We’ll explore Transformers libraries in the article, but overall, most libraries help developers create and run ML applications, such as LLMs or Text-to-Image models.

Transformers Library

It helps you run pretrained transformer models (often LLMs or Text models). It is a powerful open-source library for building and fine-tuning transformer models for natural language processing tasks. The Transformers library abstracts away much of the complexity involved in working with transformer models, allowing researchers and developers to focus on high-level tasks and rapid experimentation. Its wide adoption and support have made it a go-to library for many NLP projects and applications.

Various Functionalities Available in HF for NLP

In the NLP section of hugging face t, the tasks that we can do are

How to Build NLP Applications Using Hugging Face

Some of the interesting NLP applications that we will look into are:

Text classification: Text classification based on its nature – positive or negative (sentimental analysis), spam, or ham.
Fill mask: One or more masks will be given in a sentence where the model will find the mask.
Text Summarization: Give the text that needs to be summarized into the model, and the model returns a summary.
Text Generation: The model generates text based on the input; it babies to complete the text.
Question and Answering: Give the model some context, and the model will answer when asked a question in that context.
Translation: We use models to translate text from one language to another.
Sentence similarity: Here, the model finds similarities between one sentence and multiple other sentences. It compares one sentence to all the other sentences.

Text Classification

We will now learn how to build NLP applications with Hugging Face for text classification. Text classification is one of the most popular techniques in NLP. We classify our data into multiple labels. Some common text classification tasks are sentiment analysis, spam classification, auto-tagging queries, etc. We will do a basic sentiment analysis below. Moreover, we can use the hugging face model in two ways:

Using Pipeline
Using the model directly

We will try both methods for Sentiment analysis to get a simple overview of both. However, in most cases, Pipeline is best suited for most tasks unless some customization is required.

Using pipeline

import torch
import transformers
from transformers import pipeline
pipe = pipeline("text-classification", 
  model="distilbert/distilbert-base-uncased-finetuned-sst-2-english")

output = pipe("You have to do better in NLP")
print(output)

Text Classification | How to Build NLP Applications Using Hugging Face

output = pipe("It is very easy to create application out of hugging face")
print(output)

In the above code, we import necessary libraries like Pytorch and Transformers, which contain almost all open-source models. We import a pipeline from transformers, which creates a pipeline of the model we import. Here, we use a Distilbert model for classification. The pipeline takes care of tokenization and getting vector embeddings. Hence, we directly infer from the model. Distilbert is a pretrained model.

Using the model directly

from transformers import AutoTokenizer, AutoModelForSequenceClassification


tokenizer = AutoTokenizer.from_pretrained("
  distilbert/distilbert-base-uncased-finetuned-sst-2-english")
model = AutoModelForSequenceClassification.from_pretrained(
  "distilbert/distilbert-base-uncased-finetuned-sst-2-english")

inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
with torch.no_grad():
   logits = model(**inputs).logits


predicted_class_id = logits.argmax().item()
model.config.id2label[predicted_class_id]

Here, we import AutoTokenizer, which we will use to import the Distilbert tokenizer. Then using AutoModelForSequenceClassificaiton we import distilbert. Now that we have imported the tokenizer and model, we will use them to classify our sentences. We will get the probabilities of both POSITIVE and NEGATIVE and then, using argmax, we get the label that our model classifies.

Fill Mask

We will learn how to build NLP applications with Hugging Face for Fill Mask. Fill mask is an NLP task where the model tries to find the missing word or words in the sentence. This technique is primarily used in training language models to help them understand the context and relationships between words. We will use a distilbert base uncased to implement the Fill Mask task. We will replace one or more words in a sentence with a special token (commonly [MASK]), and the model’s job is to predict the masked words correctly.

from transformers import pipeline
unmasker = pipeline('fill-mask', model='distilbert-base-uncased')
unmasker("Hello I'm a [MASK] model.")

Fill Mask | How to Build NLP Applications Using Hugging Face

unmasker("The White man worked as a [MASK].")

Text Summarization

Next up we will learn how to build NLP applications with Hugging Face for text summarization. Text summarization in NLP aims to keep an overview of the context and explain it in fewer words. The model’s objective is to produce a coherent and fluent summary that captures the main points of the original text.

from transformers import pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

This loads Facebook’s BART (Bidirectional and Auto-Regressive Transformers). This model will do abstractive summarization. Abstractive summarization involves generating new sentences that convey the essential information from the original text, often paraphrasing and rephrasing the content.

ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, 
she got married in Westchester County, New York.
A year later, she got married again in Westchester County, 
but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. 
Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.
In 2010, she married once more, this time in the Bronx. 
In an application for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of 
"offering a false instrument for filing in the first degree,
" referring to her false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, 
according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of service 
and criminal trespass for allegedly sneaking into the New York subway through an emergency exit,
said DetectiveAnnette Markowski, a police spokeswoman. In total, 
Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island,
 New Jersey or the Bronx. She is believed to still be married to four men,
  and at one time, she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, 
who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. 
It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s 
Office by Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called 
"red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to 
his native Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  
Her next court appearance is scheduled for May 18.
"""

Let us provide this text where the model will have to summarize.

Output = summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False)
print(Output[0]['summary_text'])

max_length and min_length sets the length of the summary. The above code illustrates how we can use BART to summarize our text.

Text Generation

Text Generation is an NLP task ranging from generating the next word to generating an entire paragraph or even longer text. TeIts are applied where new content relevant to the context is needed.

from transformers import pipeline, set_seed
generator = pipeline('text-generation', model='gpt2')
set_seed(42)

We will be using GPT2 for text generation. This is the last open-source model from OpenAI. GPT2 is a generation leap in NLP. Now, the state of models like GPT-4 and GPT-4o is far better than GPT2, but still, since GPT2 is open source, we will use it.

Output = generator("Hello, I'm a language model,", max_length=30, num_return_sequences=5)
Output

text generation | How to Build NLP Applications Using Hugging Face

max_length restricts the output of gpt2 to be at maa x of 30 words. num_return_sequences is set to 5 t. This will return us to five sequences generated by GPT.

Question and Answering

Next, we will learn how to build NLP applications with Hugging Face for Question and Answering. In QnA, we use a model that can take context and answer questions in that context. Its application helps create a chatbot. A bot created with some context will answer domain-specific queries.

from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline
model_name = "deepset/roberta-base-squad2"
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

In the above code, we use the Roberta model as a QnA chatbot. Now that we have downloaded and loaded the model, we will provide context and query our model.

QA_input = {
   'question': 'Where did Liana Barrientos get married?',
   'context': """ New York (CNN)When Liana Barrientos was 23 years old, 
   she got married in Westchester County, New York.
A year later, she got married again in Westchester County, 
but to a different man and without divorcing her first husband.
Only 18 days after that marriage, she got hitched yet again. 
Then, Barrientos declared "I do" five more times, sometimes only within 
two weeks of each other.
In 2010, she married once more, this time in the Bronx. In an application
 for a marriage license, she stated it was her "first and only" marriage.
Barrientos, now 39, is facing two criminal counts of "offering a 
false instrument for filing in the first degree," referring to her 
false statements on the
2010 marriage license application, according to court documents.
Prosecutors said the marriages were part of an immigration scam.
On Friday, she pleaded not guilty at State Supreme Court in the Bronx, 
according to her attorney, Christopher Wright, who declined to comment further.
After leaving court, Barrientos was arrested and charged with theft of 
service and criminal trespass for allegedly sneaking into the 
New York subway through an emergency exit, said Detective
Annette Markowski, a police spokeswoman. In total, Barrientos has been 
married 10 times, with nine of her marriages occurring between 1999 and 2002.
All occurred either in Westchester County, Long Island, New Jersey or the Bronx. 
She is believed to still be married to four men, and at one time, 
she was married to eight men at once, prosecutors say.
Prosecutors said the immigration scam involved some of her husbands, 
who filed for permanent residence status shortly after the marriages.
Any divorces happened only after such filings were approved. 
It was unclear whether any of the men will be prosecuted.
The case was referred to the Bronx District Attorney\'s Office by 
Immigration and Customs Enforcement and the Department of Homeland Security\'s
Investigation Division. Seven of the men are from so-called "red-flagged" 
countries, including Egypt, Turkey, Georgia, Pakistan and Mali.
Her eighth husband, Rashid Rajput, was deported in 2006 to his native 
Pakistan after an investigation by the Joint Terrorism Task Force.
If convicted, Barrientos faces up to four years in prison.  
Her next court appearance is scheduled for May 18.
"""
}
res = nlp(QA_input)
res

We can see that the model’s answer to the question “Where did Liana Barrientos get married?” is “Westchester County, New York,” with a confidence score of 0.5. This is not the best, but it is considerable since we are not using a state-of-the-art model.

Translation

We convert text from one language to another with the same context and meaning as the original text. This is Machine Translation. Translation in NLP has become so advanced that some real-time translators are created using state-of-the-art models. We will use an open-source t5-base model from Google for our translation. This model is not the best, but it gets our job done.

from transformers import pipeline
translate = pipeline('translation_en_to_fr')

Here, you can see that I have not mentioned anything about the model. But your pipeline works well even then because it downloads the default model for translating from English to French. The default model is T5 based on Google.

result = translate("Hello, my name is Jose. What is your name?")
result

Translation | How to Build NLP Applications Using Hugging Face

result = translate("How are you?")
result

These translations may sound more of an exact meaning. It may not sound like people are speaking French is not the best translator out there, but it is good at its job.

Sentence Similarity

from transformers import AutoTokenizer, AutoModel
import torch
import torch.nn.functional as F


# Load the model and tokenizer
model_name = "sentence-transformers/all-MiniLM-L6-v2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

This code downloads our model (pretrained SBERT) and tokenizer. Then, after downloading them, I load them.

# Function to compute sentence embeddings
def compute_embeddings(sentences):
   inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")
   with torch.no_grad():
       outputs = model(**inputs)
   embeddings = outputs.last_hidden_state.mean(dim=1)  # Mean pooling
   return embeddings

Using toa kenizer we get the embeddings of our sentence.

# Define the sentences
sentence_to_compare = "How are you doing?"
sentences = [
   "I am fine, thank you.",
   "What are you doing today?",
   "How have you been?"
]

We then define our sentences and create embeddings for those sentences.

#import csv

Sentence Similarity | How to Build NLP Applications Using Hugging Face

Now that we have the embeddings, we can use them to find the cosine similarity. We then display the similarity of all sentences with the sentence we intend to compare with. We can infer that sentence two has the highest similarity score. This is how we do sentence similarity.

Conclusion

This article explores various applications for building NLP using the popular Hugging Face Transformers libraries. Hugging Face is a very effective and versatile tool. Hence, I am sure it will enhance your NLP journey. I recommend that everyone delve deeper into the internal workings of the models we discussed above so that they can be used effectively.

Frequently Asked Questions

Q1. What is Hugging Face, and why is it important in NLP?

A. Hugging Face is an NLP technology company. The organization provides open-source transformers, a powerful library of pre-trained models for many different NLP missions. The latter makes it somewhat feasible for a developer without intense experience in machine learning to operationalize, let alone practice, state-of-the-art NLP methods.

Q2. How does text classification work in NLP using Hugging Face?

A. Text classification works by classifying the text it identifies within pre-defined categories. Inside transformers from Hugging Face, one can use a variety of pre-trained models to classify a given body of text based on content.

Q3. What is a fill-mask, and how is it used in NLP?

A fill-mask is an example of a masked-language-modeling-based task, where the tasker masks a set number of words in a sentence and substitutes them with placeholders, with the model required to predict these missing words after that. Because of Hugging Face’s ingenuity and out-of-the-box thinking, some sophisticated models like BERT were available to strike upon this task and get the context and meanings of sentences right.

Q4. How do Hugging Face transformers help with text summarization?

A. Text summarization means taking a long text and reducing its size without losing the main points. Hugging Face makes model implementations like BART and T5 that summarize input text and produce the output quickly and precisely.

Q5. What do we mean by text generation? How do transformers help in this task?

A. Text generation extends to developing new text from a given input, where transformers like GPT-2 and GPT-3 thrive. Given an impetus to act, such Transformers can generate coherent, contextually relevant text continuations. Indeed, one may be given a paragraph-generating prompt to GPT-2 or GPT-3, and the input will be logically followed.

Badrinarayan M

Data science Trainee at Analytics Vidhya, specializing in ML, DL and Gen AI. Dedicated to sharing insights through articles on these subjects. Eager to learn and contribute to the field's advancements. Passionate about leveraging data to solve complex problems and drive innovation.

Intermediate NLP

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Introduction to Generative AI

Introduction to Generative AI applications

No-code Generative AI app development

Code-focused Generative AI App Development

Introduction to Responsible AI

LLMS

Prompt Engineering

Finetuning LLMs

Training LLMs from Scratch

Langchain

RAG

LlamaIndex

Stable Diffusion

How to Build NLP Applications with Hugging Face?

Introduction

Overview

Table of contents

What is Hugging Face(HF)?

Models in Hugging Face

Datasets in Hugging Face

Open Source Libraries and Docs

Transformers Library

Various Functionalities Available in HF for NLP

How to Build NLP Applications Using Hugging Face

Text Classification

Fill Mask

Text Summarization

Text Generation

Question and Answering

Translation

Conclusion

Frequently Asked Questions

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap

visit

li_at

s_plt

lang