Four of the easiest and most effective methods to Extract Keywords from a Single Text using Python

Ali Last Updated : 05 Jan, 2022

7 min read

This article was published as a part of the Data Science Blogathon.

Introduction

Objectives: In this tutorial, I will introduce you to four methods to extract keywords/keyphrases from a single text, which are Rake, Yake, Keybert, and Textrank. We will briefly overview each scenario and then apply it to extract the keywords using an attached example.

Prerequisite: Basic understanding of Python.

Keywords: keywords extraction, keyphrases extraction, Python, NLP, TextRank, Rake, BERT.

I would like to point out that in my previous article, I presented a method for extracting keywords from documents using TFIDF vectorizer. The TFIDF method relies on corpus statistics to weight the extracted keywords, so it cannot be applied here to a single text and this is one of its drawbacks.

To illustrate how each method of (Rake, Yake, Keybert, and Textrank) works, I’ll use the abstract of my published scientific article with the keywords specified by theme, and I will test each of the existing methods and check which ones return keywords that are closer to the words set by the author. Knowing that in such tasks of extracting keywords, there are so-called explicit keywords, which appear explicitly in the text, and implicit ones, which the author mentions as keywords without appearing explicitly in the text, but rather relating to the field.

vectorize keyword extraction — Example of text and its keywords. Source: the author (Ali Mansour)

In the example shown in the image we have the text title and article abstract, and the standard keywords (defined by the author in the original article) are marked in yellow. Note that the word “machine learning” is not explicit and is not found in the abstract. Of course, we can adopt the full text of the article, but here for the sake of simplicity, we limited ourselves only to the abstract.

Preparing text

The title is usually combined with the provided text as the title contains valuable information and reflects the content of the article in a nutshell. Thus, we will combine the text and the title simply with a plus sign between the two variables text and title:

title = "VECTORIZATION OF TEXT USING DATA MINING METHODS"
text = "In the text mining tasks, textual representation should be not only efficient but also interpretable, as this enables an understanding of the operational logic underlying the data mining models. Traditional text vectorization methods such as TF-IDF and bag-of-words are effective and characterized by intuitive interpretability, but suffer from the «curse of dimensionality», and they are unable to capture the meanings of words. On the other hand, modern distributed methods effectively capture the hidden semantics, but they are computationally intensive, time-consuming, and uninterpretable. This article proposes a new text vectorization method called Bag of weighted Concepts BoWC that presents a document according to the concepts’ information it contains. The proposed method creates concepts by clustering word vectors (i.e. word embedding) then uses the frequencies of these concept clusters to represent document vectors. To enrich the resulted document representation, a new modified weighting function is proposed for weighting concepts based on statistics extracted from word embedding information. The generated vectors are characterized by interpretability, low dimensionality, high accuracy, and low computational costs when used in data mining tasks. The proposed method has been tested on five different benchmark datasets in two data mining tasks; document clustering and classification, and compared with several baselines, including Bag-of-words, TF-IDF, Averaged GloVe, Bag-of-Concepts, and VLAC. The results indicate that BoWC outperforms most baselines and gives 7% better accuracy on average"
full_text = title +", "+ text
print("The whole text to be usedn",full_text)

prepare text text for keyword extraction — The whole text to be used. Source: the author (Ali Mansour)

Now we will start applying each of the mentioned methods to extract keywords.

YAKE!

It is a lightweight, unsupervised automatic keyword extraction method that relies on statistical text features extracted from individual documents to identify the most relevant keywords in the text. This system does not need to be trained on a particular set of documents, nor does it depend on dictionaries, text size, domain, or language. Yake defines a set of five features capturing keyword characteristics which are heuristically combined to assign a single score to every keyword. The lower the score, the more significant the keyword will be. You can read more about it here. Python package for yake.

We install the Yake! first, then we import it:

pip install git+https://github.com/LIAAD/yake
import yake

Then we have to build a KeywordExtractor object. From the Yake instance, we call the KeywordExtractor constructor, which accepts several parameters, the most important of which are: the number of words to be retrieved (top), and here we set it to 10. Lan: here we use the default “en”. A list of stop words can be passed. Next, we pass the text to the extract_keywords function, which will return a list of tuples (keyword: score). Keywords are ranging in length from 1 to 3.

kw_extractor = yake.KeywordExtractor(top=10, stopwords=None)
keywords = kw_extractor.extract_keywords(full_text)
for kw, v in keywords:
  print("Keyphrase: ",kw, ": score", v)

yake keyword extraction — Yake results. Source: the author (Ali Mansour).

We note that there are three keywords identical to the words provided by the author, which are text mining, data mining and text vectorization methods. It is interesting that YAKE! pays attention to capital letters and gives more importance to words that start with a capital letter.

Rake

Rake is short for Rapid Automatic Keyword Extraction and it is a method of extracting keywords from individual documents. It can also be applied to new fields very easily and is very effective in dealing with multiple types of documents, especially text that requires specific grammatical conventions. Rake identifies key phrases in a text by analyzing the occurrence of a word and its compatibility with other words in the text (co-occurrence).

We’ll be using a package called multi_rake. First, install it, then we import Rake:

pip install multi_rake

from multi_rake import Rake
rake = Rake()
keywords = rake.apply(full_text)
print(keywords[:10])

rake keyword extraction — Rake results. Source: the author (Ali Mansour)

We notice that there are two relevant keywords that are text mining and data mining.

TextRank

TextRank is an unsupervised method for extracting keywords and sentences. It is based on a graph where each node is a word, and edges represent relationships between words which are formed by defining the co-occurrence of words within a moving window of a predetermined size. The algorithm is inspired by PageRank which was used by Google to rank websites. It first Tokenizes and annotates text with Part of Speech (PoS). It only considers single words. However, no n-grams are used, multi-words are reconstructed later. An edge is created if lexical units co-occur within a window of N-words to obtain an unweighted undirected graph. Then it runs the text rank algorithm to rank the words. The most important lexical words are selected and then adjacent keywords are folded into a multi-word keyword.

To generate keywords using Textrank you must first install the summa package and then module keywords must be imported.

pip install summa
from summa import keywords

After that, you simply have to call the keyword function and pass the text to be handled to it. We’ll also set the scores to true to print out the relevance of each resulting keyword.

TR_keywords = keywords.keywords(full_text, scores=True)
print(TR_keywords[0:10])

Textrank results. Source: the author

KeyBert

KeyBERT is a simple, easy-to-use keyword extraction algorithm that takes advantage of SBERT embeddings to generate keywords and key phrases from a document that are more similar to the document. First, document embedding (a representation) is generated using the sentences-BERT model. Next, the embeddings of words are extracted for N-gram phrases. The similarity of each keyphrase to the document is then measured using cosine similarity. The most similar words can then be identified as the words that best describe the entire document and are considered as keywords.

To generate keywords using keybert you must first install the keybert package and then module keyBERT can be imported.

pip install keybert
from keybert import KeyBERT

Then you create an instance of keyBERT that accepts one parameter, which is the Sentences-Bert model. You can choose any embedding model you want from the following source. According to the author, the all-mpnet-base-v2 model is the best.

kw_model = KeyBERT(model='all-mpnet-base-v2')

It will start downloading like that:

keybert — Downloading BERT pre-trained model. Source: the author Ali mansour

The extract_keywords function accepts several parameters, the most important of which are: the text, the number of words that make up the keyphrase (n,m), top_n: the number of keywords to be retrieved, and finally highlight: if highlight=true it will print the text and highlight the keywords in yellow.

keywords = kw_model.extract_keywords(full_text, 

                                     keyphrase_ngram_range=(1, 3), 

                                     stop_words='english', 

                                     highlight=False,

                                     top_n=10)

keywords_list= list(dict(keywords).keys())

print(keywords_list)

dict- keywords — Keybert results with n-gram range (1,3). Source: the author.

You can change the keyphrase_ngram_range to (1,2), considering that most of the keyphrases are between 1 and 2 in length. This time we will set highlight to true

keywords — Keybert results with n-gram range (1,2). Source: the author.

It’s so amazing.

Conclusion

We have presented four of the state-of-art techniques used in the field of extracting keywords/keyphrases with a code implementation for each of them. Each of the four methods has its own advantages. Each of them succeeded in extracting keywords that are either identical to the keywords specified by the author or close to them and related to the field. The main advantage of all the mentioned methods is that they do not require training on external resources.

This work is related to my scientific activity while working on my Ph.D. I hope that the information provided will be of benefit to all. In the future, we will present an innovative new method for automating keyword extraction, and its performance will be compared with the mentioned baselines and many others.

You can check the code on my repository at GitHub. I would be grateful for any feedback.

About me: My name is Ali Mahmoud Mansour. I’am from Syria, and currently (in 2022) I am a graduate student (Ph.D. researcher) in the field of computer science. Passionate about text mining and data science.

References

Yake: Campos, Ricardo, et al. “YAKE! Keyword extraction from single documents using multiple local features.” Information Sciences 509 (2020): 257-289.
Rake: Rake Rose, Stuart, et al. “Automatic keyword extraction from individual documents.” Text mining: applications and theory 1 (2010): 1-20.
TextRank: Mihalcea, Rada, and Paul Tarau. “Textrank: Bringing order into text.” Proceedings of the 2004 conference on empirical methods in natural language processing. 2004.
keyBert: Grootendorst M. Keybert: Minimal keyword extraction with bert //. ‒ 2020.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Ali

Beginner Listicle NLP Python Text

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

bekuma

how to get code and document i am studen no money

Show 1 reply

Ali Mansour

All the codes you need are in the article. There is also a link to GitHub where the code is located

Myint

Can I use keyword extraction before word embedding?

please explain your question more

Gireesan

Please check out the new semantically distinct keyword extraction module here https://github.com/sahyagiri/DistinctKeywords

Florian

Hi and thank you for blogging these examples with python code. I was looking for lightweight keyword extraction methods for OCRd PDFs and your post helps a lot with testing different methods!

abdelfattah Saleh

can i get in contact with you...I have a project depending on keywords extraction & I need to contact you about it.

Benjamin

How is it possible to contact Mister Ali Mansour?

Thanks for your comments sorry I was not available for a long time. If you have questions contact me at mansour.mh.ali at gmail.com

Reading list

Introduction to NLP

Text Pre-processing

NLP Libraries

Regular Expressions

String Similarity

Spelling Correction

Topic Modeling

Text Representation

Information Retrieval System

Word Vectors

Word Senses

Dependency Parsing

Language Modeling

Getting Started with RNN

Different Variants of RNN

Machine Translation and Attention

Self Attention and Transformers

Transfomers and Pretraining

Question Answering

Text Summarization

Named Entity Recognition

Coreference Resolution

Audio Data

ASR

Audio Separation

Chatbot

Auto NLP

Four of the easiest and most effective methods to Extract Keywords from a Single Text using Python

Introduction

Preparing text

YAKE!

Rake

TextRank

KeyBert

Conclusion

References

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Congratulations, You Did It!

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk

SRM_I

SM

CLID

SRM_B

Google (7)

_gid

_ga_#

_gat_#

collect

AEC

G_ENABLED_IDPS

test_cookie

Webengage (2)

_we_us

WebKlipperAuth

LinkedIn (16)

ln_or

JSESSIONID

li_rm

AnalyticsSyncHistory

lms_analytics

liap