How to Train an NER model with HuggingFace?
This article was published as a part of the Data Science Blogathon.
Natural Language Processing (NLP) is a subfield of linguistics that focuses on computers’ ability to understand language in the form of text or speech.
NLP task includes :
Speech Recognition: It is the task of converting voice data to text data. It is used in chatbots, voice search systems, voice commands to IoT devices, etc.
Sentiment Analysis: Sentiment analysis (aka Opinion mining) is an NLP technique used to determine whether a given sentence/phrase delivers a positive, negative, or neutral message.
Named Entity Recognition: It is also known as NER. It is used to extract entities using pre-trained categories.
In this article, we will be focusing on Named Entity Recognition(NER) and its real-world use cases, and in the end, we will train our custom model using HuggingFace embeddings.
What is NER?
NER (Named Entity Recognition), in simple words, is one of the key components of NLP (Natural Language Processing) used for the recognition and extraction of entities with predefined(or pre-trained) categories from a plain/unstructured text.
These entities can be anything like Person’s Name, Location, Organization, Country, City, etc., depending on the categories we train our model on.
Let’s take an example:
In the above example, the model can extract PERSON, ORGANIZATION & LOCATION entities in the example.
Yes, indeed, it’s the magic of NER. 😎
In this article, we will go through the basic definition of NER and its use cases and train our own custom NER model using Hugging Face Flair embeddings.
But where can we use NER in the real world?🤔
NER Applications & Use Cases
1) Efficient search algorithms: NER can be used to extract relevant entities from search queries for better search results.
2) Resume parsing: In various companies & MNCs, NER is used for resume parsing by extracting relevant information about the candidate appearing for the job to filter out the best possible candidate among thousands of applicants.
3) PII (Personal Identifiable Information) extraction: Protecting the user’s personal information is one of the crucial tasks that every company needs to take care of. NER helps extract PII entities such as Name, DOB, Credit Card Number, SSN, Phone Number, etc., so they can be masked.
4) Chatbot: Most typical usage of NER is a chatbot. Chatbots use NER to extract keywords for answering user queries.
Yupp!! Now let’s train our model 🤘😎🤘
We will use Hugging Face(not this 🤗) flair embedding to train our own NER model.
Hugging Face is a company that provides open-source NLP technologies. It has significant expertise in developing language processing models.
Training Custom NER Model using HuggingFace Flair Embedding
There is just one problem…NER needs extensive data for training.
But we don’t need to worry, as CONLL_03 comes to the rescue!!!
CoNLL-2003 consists of a large annotated and unannotated dataset for training , testing and validation.
You can read more about CONLL_03 here.
Before starting the training, we must know the format of the NER training data.
The NER dataset should contain two columns separated by a single space. The first column consists of a single word followed by the Named Entity Tag in the second column.
Note : Column 1 must contain a single word.
Note : CONNL_03 consists of 4 columns.
Column 1 containing individual word.
Column 2 part-of-speech tag
Column 3 syntactic chunk tag
Column 4 named entity tag
In most of the cases column 2 and 3 are omitted as they are optional.
Let’s take an example :
Let the text be “George Washington went to Washington.“
So the format would be :
George B-PER Washington I-PER went O to O Washington B-LOC
Okay, I see you are confused. Don’t worry. Let me explain.
You might be wondering WHAT THE HELL IS THIS B-PER AND I-PER !!
B -> stands for Beginning
I -> stands for intermediate
So, if any phrase consists of related words, then we use the annotations.
Example: George Washington
Here, George is the first name, and Washington is the last name.
So, for NER to know that these types of words should come together while running the model on a piece of text, we prefix such words with B- or I- followed by the tag (PER, LOC, ORG, etc.) while creating our dataset.
Also, it will help or model to differentiate between 2 similar entities. Like in the above example, Washington in George Washington is the last name and also a location.
Another example: The United States of America
United B-LOC States I-LOC of I-LOC America I-LOC
That is all you need to know about your dataset before training your own custom NER model!!!
So now let’s break some keys ;
Flair is a Huggingface embedding used to perform various ML/AI tasks.
To install the flair embedding, we use the following command:
pip install flair
We need to import the following classes & embeddings
from flair.data import Corpus from flair.datasets import CONLL_03 from flair.embeddings import WordEmbeddings, StackedEmbeddings, FlairEmbeddings
Corpus & CONLL_03 is used to get the CONLL_03 corpus.
Flair supports a number of embeddings which provides different functionalities to combine words in different ways. These embeddings helps NER to perform better. WordEmbeddings, StackedEmbeddings & FlairEmbeddings are some of them.
Initializing the required variable
corpus: Corpus = CONLL_03()
As our main goal is to perform Named Entity Recognition, we provide tag_type as ‘ner.’
tag_type = 'ner'
tag_dictionary = corpus.make_tag_dictionary(tag_type=tag_type)
embedding_types = [ # GloVe embeddings WordEmbeddings('glove'), # contextual string embeddings, forward FlairEmbeddings('news-forward'), # contextual string embeddings, backward FlairEmbeddings('news-backward'), ]
Since we are using multiple embeddings, we need to stack them together. For this, we use the StackedEmbeddings.
embeddings = StackedEmbeddings(embeddings=embedding_types)
from flair.models import SequenceTagger tagger = SequenceTagger(hidden_size=256, embeddings=embeddings, tag_dictionary=tag_dictionary, tag_type=tag_type)
Training the model
from flair.trainers import ModelTrainer trainer = ModelTrainer(tagger, corpus)
Finally, let’s train our model.
trainer.train('resources/taggers/ner-english', train_with_dev=True, max_epochs=150)
resources/taggers/ner-english is the path where we want to save our model
train_with_dev = True; if we want to use the dev dataset from the corpus for training our model
max_epochs is the maximum random shuffled iterations
Note : max_epochs value should be more than 100 for better results
Testing the model
Now, as our model is trained, we can test it.
First, we import the required modules.
from flair.data import Sentence from flair.models import SequenceTagger
The sentence is used to create a Sentence object to provide to our model for the prediction of entities.
SequenceTagger is used to load the trained model
model = SequenceTagger.load('resources/taggers/ner-english/final-model.pt')
sentence = Sentence("George Washington lives in Washington")
for entity in sentence.get_spans('ner'): print(entity)
With NER, you can develop so much awesome stuff. Customer support chatbot, resume parser, custom search engine, content recommendation system, PII entity extraction, etcetera etcetera etcetera.
Your imagination is the limit!
So scratch your head, get your hands dirty…you might create something revolutionary!!!
Below are some key takeaways from the article:
- Natural Language Processing is a subfield of linguistics that focuses on the ability of computers to understand language in the form of text or speech.
- NER is a key component of Natural Language Processing to extract entities from some pre-trained categories
- MNCs use NER to develop efficient search engine algorithms, PII entity extraction, chatbots, etc.
- We also learned how to train our own custom NER model using HuggingFace flair embeddings and tested our trained model.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.