NLP: Building a Stemmer for Punjabi in Python!

Simran Kaur 20 May, 2021 • 4 min read

This article was published as a part of the Data Science Blogathon

Introduction

This problem was given to me by my professor to create a stemmer for Punjabi language in python. When I started looking online, I saw there have been a few papers developed for NLP in the Punjabi language, but I was not able to find a proper dataset for it.

IIT Bombay has created a database called the WordNet which contains data for a lot of Indian languages, but they have created a web interface and used SQL. They don’t use python for this. Since I was told by my professor to create a stemmer using python, so I couldn’t use that web interface.

So, I have written a code in python for the same. I hope you guys like it.

First, let me introduce you to stemming and the algorithm used in this code.

For grammatical reasons, documents are going to use different forms of a word, such as organize, organizes, and organizing. Additionally, there are families of related words with similar meanings, such as democracy, democratic, and democratization.

In many situations, it seems as if it would be useful for a search for one of these words to return documents that contain another word in the set.

What is stemming?

Stemming is the process of removing the affixes from words to get the root form of the word, without doing complete morphological analysis. The objective of stemming is to reduce similar words to the same stem. For instance,

am, are, is –> be
car, cars, car’s, cars’ –> car

The result of this mapping of text will be something like:

the boy’s cars are different colours –> the boy car be differ color

There is also a process called lemmatization that works on the same lines. Both stemming and lemmatization have the same goal i.e., to reduce the inflection words and related forms of a word to a common base form.

But stemming is different from lemmatization in the approach it is used to achieve this common goal.

For now, let us focus on stemming. To know more about stemming you can go check this link out!

To create a stemmer, I have used the suffix stripping algorithm.

Suffix stripping algorithm

As the name suggests, in this algorithm we strip the suffix from the word to get the root word. This algorithm doesn’t rely on a lookup table consisting of root words and inflected words. Instead, we follow a certain set of rules to remove these suffixes. These suffixes can be simple or compound. To know more about Punjabi grammar check out this link!

Let’s start with the python code. You can get this code here in Git- hub.

I would suggest you open this Git hub repository along with the article to understand properly what is actually happening in the code. So, let’s start.

First, I have created a Punjabi class. Inside this class, I have created a few functions.

In the __init__ () function (also called constructor or special function), I have created a suffix dictionary. This dictionary contains the suffixes in key-value pair format.

Next, we define a parameterize function rreplace (). This function is basically used to replace the suffix with ‘ ‘ and thus form a new word.

The first parameter string is the text in which we’ll perform the replacing technique. The second parameter ‘old’ is the text we want to replace i.e. this text will get replaced in the string.

The third parameter ‘new’ is the text that will take the place of old text in the string. The fourth parameter count (which is initially set to None) is the number of words to be replaced in the string. This function returns the final word after all the replacement.

Then we define another parameterize function gen_replacement (). This function returns the suffix from the suffixes dictionary we created initially. The words in key = ‘1’ and key=’5′ contain letters preceded by a ‘laggan’ (called ‘matra’ in Hindi). So, this function removes them and returns the new suffix.

Finally, we define the function stemmer () which takes a text as a parameter. This function is used to stem the words in the text. We create a list tag that contains the key values of the dictionary suffix. Start a for loop in which we’ll first split the text.

For all the words in this split text, we’ll start another for loop if L is in the tag. If so, we check for the flag value. If the flag is equal to 1 (flag==1), we break out of the loop otherwise we check if the word length is greater than L+1.

Here, we are checking only those words which have length greater than 2 (the words having length less than 2 are basically stop words, so we don’t actually need them).

Next, we again start a for loop. For the variable ‘suf’ in suffixes[L], we check if our word ends with that particular suffix (suf). If no, then continue in the loop otherwise, call the rreplace function. Inside this function, call the gen_replacement function. The new word is stored in word1 variable. And this word1 is stored in the dictionary dict_punj {}. Set the flag variable to 1 and break out of the loop.

At the end, we check if the flag is equal to zero or not. If so, we store the word as it is in dict_punj {} and return this dictionary otherwise, we simply return the dictionary dict_punj.

That’s it. We are done with the function creation and the class Punjabi. All we have to do now, is to create an object of the class Punjabi and using this object call the function stem () with a parameter containing text in the Punjabi language.

Conclusion

Though the efficiency of this algorithm is less than other algorithms, still it is much simpler as compared to other methods. To increase the efficiency of your stemmer, combine this suffix stripping algorithm with other algorithms. To read about it, check this paper out.

Check out this repository to know further about NLP in Punjabi.

Thanks for reading!

The media shown in this article are not owned by Analytics Vidhya and is used at the Author’s discretion.