Rapid Keyword Extraction (RAKE) Algorithm in Natural Language Processing
1. Rapid Automatic Keyword Extraction(RAKE) is a Domain-Independent keyword extraction algorithm in Natural Language Processing.
2. It is an Individual document-oriented dynamic Information retrieval method.
3. Concept of RAKE is built on three matrices Word Degree (deg(w)), Word Frequency (freq(w)), Ratio of the degree to frequency (deg(w)/freq(w)).
In the field of Machine Learning, thanks to ‘No Free Lunch” (NFL), we have multiple options of algorithms to solve a problem. Is it a boon? Unfortunately, it is not. You can not run for the entire buffet menu. While I was working on a project based on NLP, this was precise, what had happened to me. Due to time constraints, I had to find a ready-made algorithm to extract text features from vast unstructured text data. There was no time for me to build an algorithm. So I did my research work, and I was utterly confused with my table’s number of options. Then I stumbled upon an algorithm developed by Stuart Rose et al., Rapid Automatic Keyword Extraction (RAKE). I immediately stop my searching. Now here, my idea is to give you a brief intuition of the algorithm. Folks Interested in learning NLP-based unsupervised learning can find this algorithm very useful.
Feature extraction from Text
While you are analyzing unstructured text, as we found in social media posts and e-commerce feedback, the challenge faced by most of us is how to filter it? Here I am not suggesting data cleaning. Thanks to various libraries like TextBlob in Python Language, this data cleaning part is easier to handle because there is always a structure in unstructured data for domain-specific content. Like in an E-commerce website, while giving feedback on a product, we generally found the pattern of the writing style is the same—spelling mistakes, a Mixture of language use, Unicode character, etc. But the problem arises if you are looking for specific features, like most talked features. Take an example of a Mobile Phone. What can be the most spoken features? ‘Camera,” Screen,” Performance,” Battery life,’ etc. Let’s say you rank them from highest spoken to most minor spoken features. Do you think the ranking of these features always will stay the same? Or say if you have a predefined static feature list, do you think it will serve a purpose? The answer to this is no. To rescue us from this, here comes some ready-made available techniques or algorithms.
Word Frequency, Word Collocations, Co-occurrences, Term frequency-inverse document frequency, Linguistic Approaches, Graph-based Approaches are various unsupervised feature extraction techniques. They have diverse applicability based on your needs and requirements. But for myself during that time, I had discovered a helpful algorithm, RAKE.
Let me discuss Rapid Automatic Keyword Extraction (RAKE) algorithm. First, I will give you the intuition of the algorithm. Then the python code perspective.
One of the critical points made by the creator of RAKE is that keywords frequently contain multiple words but rarely contain punctuation, stop words, or other words with minimum lexical meaning. Here the inventor mainly spoke about the Collocation and the Co-occurrence of the word. While analyzing mobile phones, feedback data from an e-commerce website, you will see bi-gram like ‘Good Camera,’ ‘Customer Service.’ These words in the feedback domain of a particular product frequently occurred together. It is Collocation. Now consider the following bi-grams ‘Bad Camera’ and ‘Worst Camera.’ Words like ‘Bad’ and’ Worst’ have semantic proximity, which means similar. The pre-mentioned bi-grams(‘Bad Camera’ and ‘Worst Camera.’ ) have a higher probability of co-occurring or appearing in specific domains where the camera module is attached. Examples are Mobile Phones, DSLR cameras, etc.
Once we have the text corpus, RAKE splits the text into a list of words, removing stop words from the same list. Return list is known as Content Words. Folks familiar with Natural language processing are aware of the terms stop words. Words like ‘are’, ‘not’, ‘there,’ ‘is’ do not add any meaning in a sentence. Ignoring them will make our main corpus lean and clean.
Let’s take a live example of a sentence :
“Feature extraction is not that complex. There are many algorithms available that can help you with feature extraction. Rapid Automatic Key Word Extraction is one of those”.
Initial word list : (Consider converting the enter corpus into a small case you can use TextBlob)
· Corpus=[ feature, extraction, is, not, that, complex, there, are, many, algorithms, available, that, can, help, you, with, feature, extraction, rapid, automatic, keyword, extraction, is, one, of, those]
Let’s Highlight the stop words :
“Feature extraction is not that complex. There are many algorithms available that can help you with feature extraction. Rapid Automatic Keyword Extraction is one of those”
** Notes : I have considered ‘many’ as stop word intentionally. You can ignore it when you are practicing.
· Stopword=[ is, not, that, there, are, many, that, can, you, with, is, one, of, those]
Content_Word= Corpus – Stopwords – Delimiter
· Content_Word=[ feature, extraction, complex, algorithms, available, help, feature, extraction, rapid, automatic, keyword, extraction]
Now, when we have the content word, this list also considers the text as candidate key phrases. Below is an example of the same where candidate phrases are being highlighted in bold.
“feature extraction is not that complex. There are many algorithms available that can help you with feature extraction. rapid automatic keyword extraction is one of those”.
Let’s create a Word degree matrix like below, where each row will display the number of times a given content word co-occurs with another content word in candidate key phrases.
We have to give a score to each word. Calculate the ‘degree of a word’ in the matrix, the sum of the number of co-occurrences, then divide them by their occurrence frequency. Occurrence frequency means how many times the word occurs in the primary corpus. Check below.
Now consider the candidate key phrases and consider a combined(sum) score of each word for each candidate key phrases. It will look like below.
Suppose two keywords or key phrases appear together in the same order more than twice. A new key phrase is created regardless of how many stop words the key phrases contained in the original text. The score of the key phrases is computed just like the one for a single key phrase.
When first time I had used RAKE, it was an incredible feeling. The logic behind the algorithm is simple, but the result was mind-blowing. For the past two years, I have always tried to use RAKE while dealing with Feature Extraction-related problems in NLP within a brief period. One request to you to go through the original research paper. I have given the link below. Also, I have added some code snippets below.
Coding via Python
# TO ISTALL : pip install rake-nltk
#COMMON USE r = Rake() text="Feature extraction is not that complex. There are many algorithms available that can help you with feature extraction. Rapid Automatic Key Word Extraction is one of those" r.extract_keywords_from_text(text) r.get_ranked_phrases() #outout ['rapid automatic key word extraction', 'many algorithms available', 'feature extraction', 'one', 'help', 'complex']
#SPECIAL CASES r.get_ranked_phrases_with_scores() #output [(23.5, 'rapid automatic key word extraction'), (9.0, 'many algorithms available'), (5.5, 'feature extraction'), (1.0, 'one'), (1.0, 'help'), (1.0, 'complex')] #to control the max or min words in a phrase r = Rake(min_length=2, max_length=4) #to control whether or not to include repeated phrases in text # in our example (feature,extraction ) occurring twice we can choose so select it one time. from rake_nltk import Rake # To include all phrases even the repeated ones. r = Rake() # Equivalent to Rake(include_repeated_phrases=True) # To include all phrases only once and ignore the repetitions r = Rake(include_repeated_phrases=False)