How to Train an NER model with HuggingFace?

Praveen Pushpkar 23 Jun, 2022 • 7 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Natural Language Processing (NLP) is a subfield of linguistics that focuses on computers’ ability to understand language in the form of text or speech. 

NLP task includes :

Speech Recognition: It is the task of converting voice data to text data. It is used in chatbots, voice search systems, voice commands to IoT devices, etc.

Sentiment Analysis: Sentiment analysis (aka Opinion mining) is an NLP technique used to determine whether a given sentence/phrase delivers a positive, negative, or neutral message.

Named Entity Recognition: It is also known as NER. It is used to extract entities using pre-trained categories.

In this article, we will be focusing on Named Entity Recognition(NER) and its real-world use cases, and in the end, we will train our custom model using HuggingFace embeddings.

What is NER?

NER (Named Entity Recognition), in simple words, is one of the key components of NLP (Natural Language Processing) used for the recognition and extraction of entities with predefined(or pre-trained) categories from a plain/unstructured text.

These entities can be anything like Person’s Name, Location, Organization, Country, City, etc., depending on the categories we train our model on.

Let’s take an example:

In the above example, the model can extract PERSON, ORGANIZATION & LOCATION entities in the example.

St Patricks Day Rainbow GIF | huggingface

Yes, indeed, it’s the magic of NER. 😎

In this article, we will go through the basic definition of NER and its use cases and train our own custom NER model using Hugging Face Flair embeddings.

But where can we use NER in the real world?🤔

NER Applications & Use Cases

1) Efficient search algorithms: NER can be used to extract relevant entities from search queries for better search results.

2) Resume parsing: In various companies & MNCs, NER is used for resume parsing by extracting relevant information about the candidate appearing for the job to filter out the best possible candidate among thousands of applicants.

3) PII (Personal Identifiable Information) extraction: Protecting the user’s personal information is one of the crucial tasks that every company needs to take care of. NER helps extract PII entities such as Name, DOB, Credit Card Number, SSN, Phone Number, etc., so they can be masked.

4) Chatbot: Most typical usage of NER is a chatbot. Chatbots use NER to extract keywords for answering user queries.

Etcetera etcetera

Season 8 Wow GIF by The Office | huggingface

Yupp!! Now let’s train our model 🤘😎🤘

We will use Hugging Face(not this 🤗) flair embedding to train our own NER model.

Hugging Face is a company that provides open-source NLP technologies. It has significant expertise in developing language processing models.

Training Custom NER Model using HuggingFace Flair Embedding

There is just one problem…NER needs extensive data for training.

Despicable Me Reaction GIF | huggingface

But we don’t need to worry, as CONLL_03 comes to the rescue!!!

CoNLL-2003 consists of a large annotated and unannotated dataset for training , testing and validation.

You can read more about CONLL_03 here.

Before starting the training, we must know the format of the NER training data.

The NER dataset should contain two columns separated by a single space. The first column consists of a single word followed by the Named Entity Tag in the second column.

Note : Column 1 must contain a single word.

Note : CONNL_03 consists of 4 columns.
Column 1 containing individual word.
Column 2 part-of-speech tag
Column 3 syntactic chunk tag
Column 4 named entity tag
In most of the cases column 2 and 3 are omitted as they are optional.

Let’s take an example :

Let the text be “George Washington went to Washington.

So the format would be :

George B-PER
Washington I-PER
went O
to O
Washington B-LOC

Question Mark What GIF by Alex Aiono | huggingface

Okay, I see you are confused. Don’t worry. Let me explain.

You might be wondering WHAT THE HELL IS THIS B-PER AND I-PER !!

B -> stands for Beginning

I -> stands for intermediate

So, if any phrase consists of related words, then we use the annotations.

Example: George Washington

Here, George is the first name, and Washington is the last name.

So, for NER to know that these types of words should come together while running the model on a piece of text, we prefix such words with B- or I- followed by the tag (PER, LOC, ORG, etc.) while creating our dataset.

Also, it will help or model to differentiate between 2 similar entities. Like in the above example, Washington in George Washington is the last name and also a location.

Another example: The United States of America

United B-LOC
States I-LOC
of I-LOC
America I-LOC

i got it GIF by Team Coco

That is all you need to know about your dataset before training your own custom NER model!!!

So now let’s break some keys ;

The Code

Installing flair :

Flair is a Huggingface embedding used to perform various ML/AI tasks.

To install the flair embedding, we use the following command:

pip install flair
Imports :

We need to import the following classes & embeddings 

from flair.data import Corpus        
from flair.datasets import CONLL_03
from flair.embeddings import WordEmbeddings, StackedEmbeddings, FlairEmbeddings

Here,

Corpus & CONLL_03 is used to get the CONLL_03 corpus.

Flair supports a number of embeddings which provides different functionalities to combine words in different ways. These embeddings helps NER to perform better. WordEmbeddings, StackedEmbeddings & FlairEmbeddings are some of them.

Initializing the required variable

Now, we use the CONLL_03 to get our corpus
corpus: Corpus = CONLL_03()
Type of tag you want to predict

As our main goal is to perform Named Entity Recognition, we provide tag_type as ‘ner.’

tag_type = 'ner'
We need to get the tagged dictionary from our corpus that we initialized earlier.
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
We now initialize the embeddings we want to use for our model training. There are several embeddings provided by Huggingface, each serving its unique purpose. You can read more about the HuggingFace embeddings on their official website.
embedding_types = [
    # GloVe embeddings
    WordEmbeddings('glove'),
    # contextual string embeddings, forward
    FlairEmbeddings('news-forward'),
    # contextual string embeddings, backward
    FlairEmbeddings('news-backward'),
]
Using StackedEmbeddings to combine all of our embeddings

Since we are using multiple embeddings, we need to stack them together. For this, we use the StackedEmbeddings.

embeddings = StackedEmbeddings(embeddings=embedding_types)
Init sequence tagger for predicting labels for single tokens
from flair.models import SequenceTagger
tagger = SequenceTagger(hidden_size=256,             
                    embeddings=embeddings, 
                    tag_dictionary=tag_dictionary,
                    tag_type=tag_type)

Here,

hidden_size = Hidden size of RNN layer
embeddings = Embeddings to use during training and prediction
tag_dictionary = Dictionary of all tags from the corpus
tag_type = ‘ner’

Training the model

Now let’s import the Model trainer and initialize it with the tagger and corpus mentioned earlier.

 

from flair.trainers import ModelTrainer
trainer = ModelTrainer(tagger, corpus)

Finally, let’s train our model. 

trainer.train('resources/taggers/ner-english',
                       train_with_dev=True,
                       max_epochs=150)

Here,
resources/taggers/ner-english is the path where we want to save our model
train_with_dev = True; if we want to use the dev dataset from the corpus for training our model
max_epochs is the maximum random shuffled iterations

Note : max_epochs value should be more than 100 for better results

Annnddd Yippieee!!!

Yippie GIF by MOODMAN

We have successfully trained our own custom NER model!
Let’s test it…

Testing the model

Now, as our model is trained, we can test it.

First, we import the required modules.

from flair.data import Sentence
from flair.models import SequenceTagger

Here,
The sentence is used to create a Sentence object to provide to our model for the prediction of entities.
SequenceTagger is used to load the trained model

Loading the trained model
model = SequenceTagger.load('resources/taggers/ner-english/final-model.pt')
Creating example sentence
sentence = Sentence("George Washington lives in Washington")
Predicting the tags
model.predict(sentence)
Printing the predicted tags
for entity in sentence.get_spans('ner'):
    print(entity)
Output :

WE DID IT!!!!!

we did it wow GIF by Late Night with Seth Meyers

Congratulations!!! You just trained your own custom NER model.

Conclusion

Today we have discussed NER (Named Entity Recognition) model with Huggingface.  Its real-world use cases, and also trained our custom model using hugging face.

With NER, you can develop so much awesome stuff. Customer support chatbot, resume parser, custom search engine, content recommendation system, PII entity extraction, etcetera etcetera etcetera. 

Your imagination is the limit!

So scratch your head, get your hands dirty…you might create something revolutionary!!!

Below are some key takeaways from the article:

  • Natural Language Processing is a subfield of linguistics that focuses on the ability of computers to understand language in the form of text or speech.
  • NER is a key component of Natural Language Processing to extract entities from some pre-trained categories
  • MNCs use NER to develop efficient search engine algorithms, PII entity extraction, chatbots, etc.
  • We also learned how to train our own custom NER model using HuggingFace flair embeddings and tested our trained model.

The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion. 

Praveen Pushpkar 23 Jun 2022

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Natural Language Processing
Become a full stack data scientist