Classifying Sanskrit Shlokas Using an LSTM-based Model

Suvrat Arora 24 Feb, 2023 • 8 min read

This article was published as a part of the Data Science Blogathon.

Introduction

One isn’t obscure to the kind of knowledge that ancient Indian scriptures treasure. Though these scriptures are in many languages, most of them happen to be in Sanskrit. So, why not employ the power of Natural Language Processing and fiddle a little with Sanskrit text?

Sanskrit is one of the most ancient and unambiguous languages in the world. It is one of the few languages that identifies three grammatical genders (Masculine, Feminine, and Neuter) and three grammatical counting cases (Singular, Plural, and Dual). Given its inclusiveness and unambiguity, a few studies argue that Sanskrit is one of the best-suited languages for Natural Language Processing. Though Sanskrit is not practiced as a modern-day language, its text is available in abundance in Hindu scriptures and ancient Indian literature.

In this article, we will try our hands on NLP in Sanskrit. We will be performing the classification of Sanskrit Shlokas (Verses).

Categorization of Sanskrit Shlokas

First, let us understand what Sanskrit Slokas are and on what grounds we will classify them. So this is how a quick Google search defines the term ‘Shloka’:

Shloka: a couplet of Sanskrit verse, especially one in which each line contains sixteen syllables.

These couplets, written in Sanskrit, usually embody religious praises or knowledge of the ways of life.

The following is a typical example of a Shloka along with its English translation:

For our classification task, we will be classifying the Shlokas into the following three classes:

Chanakya Shlokas: These Shlokas are the ones obtained from The Chanakya Niti Sastra, which is an anthology of Shlokas compiled from various Hindu sastras attributed to the Indian philosopher Chanakya.
Vidur Niti Shlokas: These Slokas belong to Vidura Niti, which is an ethical philosophy that was narrated in the form of a conversation, a rich discourse on polity and religiousness between Vidura and King Dhritarashtra in Mahabharata (a Hindu Epic tale)
Sanskrit Slogans: These Shlokas are not attributed to any particular source. This can be treated as the ‘others’ category.

Dataset

We will be using the iNLTK Sanskrit Shlokas Dataset that comprises about 500 Shlokas labeled as Chanakya Slokas, Vidur Niti Slokas and Sanskrit-slogan. The Shlokas have already been cleaned and divided into training and testing as CSV files.

The dataset can be obtained from Kaggle. Please note that the dataset lies under the Creative Commons licence.

Building the Shloka Classifier

Let us tackle this stepwise. Firstly, as the classic data science advice says, we should get to know our data better and then build our model accordingly. We’ll proceed in three broad steps:

Exploratory Data Analysis
Data Pre-Processing
Model Building and Evaluation

Exploratory Data Analysis

Step-1: Import all the requisite Python libraries

#Import Necessary Libraries
import pandas as pd
import numpy as np
from wordcloud import WordCloud
import matplotlib.pyplot as plt

Here, we’ve used Pandas to load the CSV dataset, Numpy to perform mathematical operations, word cloud to build a visual of our text and Matplotlib to plot graphs as required.

Step-2: Load the Dataset

This is what our data looks like:

#Loading the dataset
data = pd.read_csv('../input/sanskrit-shlokas-dataset/train.csv')

Step-3: Create text Vocabulary-Frequency distribution

Now we need to create a Vocabulary-Frequency distribution for our text. Vocabulary refers to the set of unique words in a text. So we need to store each unique word and its frequency of occurrence in our dataset.

For this, we have first stored all the Shlokas in the training dataset into a single string named ‘text’. Then we stored each unique word as a key in a dictionary named ‘vocab’ and its frequency as the value.

This distribution will help us identify the stopwords in our text. An excerpt from the distribution is shown below:

'पञ्च': 2,

'यत्र': 10,

'न': 122,

'विद्यन्ते': 2,

'कुर्यात्तत्र': 1,

 'संगतिम्': 1,

Step-6: Plot the Output class label frequencies

Now, we’ll generate a bar chart for the frequencies of each output class label.

#Plot class label frequencies
Class = data['Class'].value_counts()
names = ['Chanakya Slokas','Vidur Niti Slokas','sanskrit-slogan']
values = [Class['Chanakya Slokas'],Class['Vidur Niti Slokas'],Class['sanskrit-slogan']]

plt.bar(range(len(values)), values, tick_label=names)
plt.show()

The bar chart turns out as follows:

We can see that number of training examples corresponding to each class is nearly equal i.e. our dataset is balanced.

By now, we have gained quite decent insight into our data; now, let’s move on to pre-processing the data.

Data Pre-Processing

Since we’ll be building an LSTM-based deep learning classifier, we first need to convert our training text into embeddings. For this, we’ll use TensorFlow’s tokenizer. First, we need to train the tokenizer on the entire training text to ensure it fits on its vocabulary. Then we convert the training text to embeddings using the texts_to_sequences() method. Finally, we pad all the generated sequences so as to make them equal in length.

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

tokenizer = Tokenizer(num_words=500, split=' ') 
tokenizer.fit_on_texts(data['Sloka'].values)
X = tokenizer.texts_to_sequences(data['Sloka'].values)
X = pad_sequences(X)

The generated embeddings would look like this:

array([[  0,   0,   0, ...,   6, 320,   1],
       [  0,   0,   0, ..., 326, 327,   1],
       [  0,   0,   0, ..., 334,  19, 335],
       ...,
       [  0,   0,   0, ..., 239,  76,  42],
       [  0,   0,   0, ...,   4, 100,  38],
       [  0,   0,   0, ...,   0,   0,   1]], dtype=int32)

After generating the embeddings, our text is ready to be fed to any model. But we must note that our output classes are categorical in nature. Thus, they must be one hot encodes. You can use sklearn’s one hot encoder for the same. However, here I’ve used Pandas’ get_dummies function.

#One Hot Encoding 
Y= pd.get_dummies(data['Class'])

S

Suvrat Arora 24 Feb 2023

Beginner Deep Learning Machine Learning Python

Frequently Asked Questions

Responses From Readers

Ranjani 28 Jul, 2022

very good explanation

Write for us

Write, captivate, and earn accolades and rewards for your work

Reach a Global Audience
Get Expert Feedback
Build Your Brand & Audience

Cash In on Your Knowledge
Join a Thriving Community
Level Up Your Data Science Game

Rahul Shah

Sion Chakrabarti

CHIRAG GOYAL

Barney Darlington

Suvojit Hore

Arnab Mondal

Prateek Majumder