Named Entity Recognition (NER) in Python with Spacy

prateekmaj21 13 Sep, 2023
6 min read

Natural Language Processing deals with text data. The amount of text data generated these days is enormous. And, this data if utilized properly can bring many fruitful results. Some of the most important Natural Language Processing applications are Text Analytics, Parts of Speech Tagging, Sentiment Analysis, and Named Entity Recognition. The vast amount of text data contains a huge amount of information. An important aspect of analyzing these text data is the identification of Named Entities. In this article we will be discussing Named Entity Recognition in python / NER using Spacy!

What is a Named Entity?

A named entity is basically a real-life object which has proper identification and can be denoted with a proper name. Named Entities can be a place, person, organization, time, object, or geographic entity.

For example, named entities would be Roger Federer, Honda city, Samsung Galaxy S10. Named entities are usually instances of entity instances. For example, Roger Federer is an instance of a Tennis Player/person, Honda City is an instance of a car and Samsung Galaxy S10 is an instance of a Mobile Phone. 

Named Entity Recognition in Python

Python Named Entity Recognition is the process of NLP which deals with identifying and classifying named entities. The raw and structured text is taken and named entities are classified into persons, organizations, places, money, time, etc. Basically, named entities are identified and segmented into various predefined classes.

NER systems are developed with various linguistic approaches, as well as statistical and machine learning methods. It has many applications for project or business purposes.

NER model first identifies an entity and then categorizes the entity into the most suitable class. Some of the common types of Named Entities will be:

1. Organisations :

NASA, CERN, ISRO, etc

2. Places:

Mumbai, New York, Kolkata.

3. Money:

1 Billion Dollars, 50 Great Britain Pounds.

4. Date:

15th August 2020

5. Person:

Elon Musk, Richard Feynman, Subhas Chandra Bose.

An important thing about NER models is that their ability to understand Named Entities depends on the data they have been trained on. There are many applications of NER.

NER can be used for content classification, the various Named Entities of a text can be collected, and based on that data, the content themes can be understood. In academics and research, NER can be used to retrieve data and information faster from a wide variety of textual information. NER helps a lot in the case of information extraction from huge text datasets.

NER Using Spacy

Spacy is an open-source Natural Language Processing library that can be used for various tasks. It has built-in methods for Named Entity Recognition. Spacy has a fast statistical entity recognition system.

We can use spacy very easily for NER tasks. Though often we need to train our own data for business-specific needs, the spacy model general performs well for all types of text data.

Let us get started with the code, first we import spacy and proceed.

import spacy
from spacy import displacy

NER = spacy.load("en_core_web_sm")

Now, we enter our sample text which we shall be testing. The text has been taken from the Wikipedia page of ISRO.

raw_text="The Indian Space Research Organisation or is the national space agency of India, headquartered in Bengaluru. It operates under Department of Space which is directly overseen by the Prime Minister of India while Chairman of ISRO acts as executive of DOS as well."
text1= NER(raw_text)

Now, we print the data on the NEs found in this text sample.

for word in text1.ents:
    print(word.text,word.label_)

The Output: 

The Indian Space Research Organisation ORG
the national space agency ORG
India GPE
Bengaluru GPE
Department of Space ORG
India GPE
ISRO ORG
DOS ORG

So, now we can see that all the Named Entities in this particular text are extracted. If, we are facing any problem regarding what type a particular NE is, we can use the following method.

spacy.explain("ORG")

Output: ‘Companies, agencies, institutions, etc.’

spacy.explain("GPE")

Output: ‘Countries, cities, states’

Now, we try an interesting visual, which shows the NEs directly in the text.

displacy.render(text1,style="ent",jupyter=True)

Output:

2 Named Entity Recognition

 

I will leave the Kaggle Link in the end, so that the readers can try out the code themselves. Coming to the visual, the Named Entities are properly mentioned in the text, with contrasting colors, which make data visualization quite easy and simple. There is another type of visual, which explores the full dataset as a whole. Please refer to the Kaggle link in the end.

Let us try the same tasks with some tests containing more Named Entities.

raw_text2=”The Mars Orbiter Mission (MOM), informally known as Mangalyaan, was launched into Earth orbit on 5 November 2013 by the Indian Space Research Organisation (ISRO) and has entered Mars orbit on 24 September 2014. India thus became the first country to enter Mars orbit on its first attempt. It was completed at a record low cost of $74 million.”

text2= NER(raw_text2)
for word in text2.ents:
    print(word.text,word.label_)

Output: 

The Mars Orbiter Mission PRODUCT
MOM ORG
Mangalyaan GPE
Earth LOC
5 November 2013 DATE
the Indian Space Research Organisation ORG
ISRO ORG
Mars LOC
24 September 2014 DATE
India GPE
first ORDINAL
Mars LOC
first ORDINAL
$74 million MONEY

Here, we get more types of named entities. Let us identify what type they are.

spacy.explain("PRODUCT")

Output: ‘Objects, vehicles, foods, etc. (not services)’

spacy.explain("LOC")

Output: ‘Non-GPE locations, mountain ranges, bodies of water’

spacy.explain("DATE")

Output: ‘Absolute or relative dates or periods’

spacy.explain("ORDINAL")

Output: ‘ “first”, “second”, etc.’

spacy.explain("MONEY")

Output: ‘Monetary values, including unit’

Now, we analyze the text as a whole in the form of a visual.

displacy.render(text2,style="ent",jupyter=True)

Output: 

Named Entity Recognition 3

Here, we the various Named Entities in contrasting colors, so we understand the overall nature of the text.

NER of a News Article

We shall web scrape data from a news article and do NER on the text data gathered from there.

We shall use Beautiful Soup for web scraping purposes.

from bs4 import BeautifulSoup
import requests
import re

Now, we will use the URL of the news article.

URL="https://www.zeebiz.com/markets/currency/news-cryptocurrency-news-today-june-12-bitcoin-dogecoin-shiba-inu-and-other-top-coins-prices-and-all-latest-updates-158490"
html_content = requests.get(URL).text
soup = BeautifulSoup(html_content, "lxml")

Now, we get the body content.

body=soup.body.text

Now, we use regex to clean the text.

body= body.replace('n', ' ')
body= body.replace('t', ' ')
body= body.replace('r', ' ')
body= body.replace('xa0', ' ')
body=re.sub(r'[^ws]', '', body)

Let us now have a look at the text.

body[1000:1500]
'       View in App    Bitcoin was down by 6 and was trading at Rs 2728815 after hitting days high of Rs 2900208 Source Reuters        Reported By ZeeBiz WebTeam Written By Ravi Kant Kumar      Updated Sat Jun 12 20210646 pm   Patna ZeeBiz WebDesk    RELATED NEWS            Cryptocurrency Latest News Today June 14 Bitcoin leads crypto rally up over 12 after ELON MUSK TWEET Check Ethereum Polka Dot Dogecoin Shiba Inu and other top coins INR price World India updates             Bitcoin law is only'

Now, let us proceed with Named Entity Recognition.

text3= NER(body)
displacy.render(text3,style="ent",jupyter=True)

Well, the visual formed is very large, but there are some interesting parts which I want to cover.

display render

Final Thoughts

Named Entity Recognition (NER) is a crucial technique in natural language processing and can be implemented in Python using various libraries such as spaCy, NLTK, and StanfordNLP. Our Blackbelt course on NER in Python likely provides in-depth knowledge and practical skills in implementing NER using Python libraries. Mastering NER is beneficial for various applications such as sentiment analysis, chatbots, and information extraction from unstructured text data.

Frequently Asked Questions

Q1. What is named entity recognition with example?

A. Named Entity Recognition (NER) is a natural language processing technique that identifies and classifies named entities in text into predefined categories, such as people, organizations, and locations. For example, in the sentence “John works at Google in New York”, NER would identify “John” as a person, “Google” as an organization, and “New York” as a location.

Q2. What is name-based entity recognition?

A. Name-based Entity Recognition is a type of NER that specifically focuses on identifying and extracting named entities that are people or organizations. It helps in social media analysis or news articles, where identifying individuals and organizations is important.

Q3. What are the 3 steps in named entity recognition?

A. The three steps in named entity recognition are:

1. Tokenization, which involves breaking the text into individual words or phrases.
2. Part-of-speech tagging, which assigns a grammatical tag to each word.
3. Entity recognition, which identifies and classifies the named entities in the text.

Q4. How does named entity recognition work?

A. NER uses machine learning algorithms to analyze text and identify patterns that indicate the presence of named entities. These algorithms are trained on large datasets of annotated text, where human annotators have labeled the named entities in the text. When presented with new text, the NER algorithm applies the patterns it has learned to identify and classify the named entities in the text.

Q5. What is a SpaCy NER?

A. SpaCy NER (Named Entity Recognition) is a feature of the spaCy library used for natural language processing. It automatically identifies and categorizes named entities (e.g., persons, organizations, locations, dates) in text data. spaCy NER is valuable for information extraction, entity recognition in documents, and improving the understanding of text content in various applications like chatbots, text analytics, and content categorization.

prateekmaj21 13 Sep, 2023

Prateek is a dynamic professional with a strong foundation in Artificial Intelligence and Data Science, currently pursuing his PGP at Jio Institute. He holds a Bachelor's degree in Electrical Engineering and has hands-on experience as a System Engineer at TCS Digital, where he excelled in API management and data integration. Prateek also has a background in product marketing and analytics from his time with start-ups like AppleX and Milkie Way, Inc., where he was involved in growth campaigns and technical blog management. Recognized for his structured thinking and problem-solving abilities, he has received accolades like the Dr. Sudarshan Chakraborty Award for Best Student Performance. Fluent in multiple languages and passionate about technology, Prateek continues to expand his expertise in the rapidly evolving AI and tech landscape.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,