Beginner’s Guide To Text Classification Using PyCaret
Have you ever solved a Machine Learning problem in just one go?
Solving a problem using machine learning isn’t straightforward. It involves various steps to come up with an accurate solution. The process/steps to be followed for solving an ml problem is known as ML Pipeline/ML Cycle.
As shown in the figure, the Machine Learning pipeline consists of different steps like:
Understand Problem Statement, Hypothesis Generation, Exploratory Data Analysis, Data Preprocessing, Feature Engineering, Feature Selection, Model Building, Model Tuning, and Model Deployment.
I would recommend going through the below articles for in detailed understanding of the Machine Learning pipeline:
The process of solving a machine learning problem involves a lot of time and human effort. Hip Hip Hooray! It’s no longer a tedious and time-consuming process! Thanks to AutoML for providing instant solutions to ML problems.
AutoML is all about automatically building the high-performance model with the least human intervention
AutoML libraries offer low-code and no-code programming.
You’ve probably heard of the terms “low-code” and “no-code.”
- No-code frameworks are simple UI’s that enable even non-technical users to build models without writing a single line of code.
- Low-code refers to minimum coding.
Though no-code platforms make it simple to train a Machine Learning model using a drag-and-drop interface, they are limited in terms of flexibility. Low-code ML, on the other hand, is the sweet spot and middle ground, as they offer both flexibility and easy-to-use code.
In this article, let us understand how to build a text classification model within a few lines of code using a low code AutoML library, PyCaret.
Table of Contents
- What is PyCaret?
- Why do we need PyCaret?
- Different Approaches to solving text classification in PyCaret
- Topic Modelling
- Count Vectorizer
- Case Study – Text Classification using PyCaret
What is PyCaret?
PyCaret is an open-source, low-code machine learning library in Python that allows you to go from preparing your data to deploying your model within a few minutes.
PyCaret is essentially a low-code library that replaces hundreds of lines of code in scikit learn to 5-6 lines of code. It increases the productivity of the team and helps the team to focus on understanding the problem and feature engineering rather than model optimization.
PyCaret is built on top of a scikit learn library. As a result, all the machine learning algorithms available in scikit learn are available in pycaret. As of now, PyCaret can solve problems related to Classification, Regression, Clustering, Anomaly detection, Text Classification, Associate Rule Mining, and Time Series.
Now, let us discuss the reasons behind using PyCaret.
Why do we need PyCaret?
PyCaret automatically builds the benchmark model given a dataset within 5-6 lines of code. Let’s see how pycaret simplifies each step in the machine learning pipeline.
- Data Preparation: PyCaret does the data cleaning and data preprocessing with the least manual intervention.
- Feature Engineering: PyCaret creates the mathematical features automatically and selects the most important features required for model
- Model Building: It greatly simplifies the modeling portion of your project. We can build different models and select the top-performing models with one single line of code.
- Model Tuning: PyCaret finetunes the model without explicitly passing the hyperparameters to each model.
Next, we will focus on solving a text classification problem in PyCaret.
Different Approaches to solving text classification in PyCaret
Let’s solve a text classification problem in PyCaret using 2 different techniques-
- Topic Modeling
- Count Vectorizer
I will touch upon each approach in detail
Topic Modeling, as the name conveys, is a technique to identify different topics present in the text data.
Topics are defined as a repeating group of statistically significant tokens (or words) in a corpus. Here, statistical significance refers to important words in the document. Generally, the frequently occurring words with higher TF-IDF scores are considered to be statistically significant words.
Topic modeling is an unsupervised technique to automatically find the hidden topics in text data. It can also be referred to as the text mining approach to find recurring patterns in text documents.
Some common use-cases of topic modeling are as follows:
- ￼Solve text classification/regression problems
- Creating relevant tags to documents
- Generate insights for customer feedback forms, customer reviews, survey results, etc.
Example of Topic Modeling
Let’s say you work for a legal firm and you’re working with a company where there’s some money that’s been embezzled, and you know there’s some key information lying in the emails that have been set around the company.
- So, you go through the emails and there are hundreds of thousands of emails. Now, what you need to do is, you need to figure out which ones are related to money versus other topics.
- You can either hand label them based on what you read in the text, which would take a long time, or you can use the technique called topic modeling to find out what these labels are and automatically label all these emails.
As explained earlier, the objective of topic modeling is to extract different topics from the raw text. But, what’s the underlying algorithm to achieve it?
This drives us to the different algorithms/techniques to topic modeling – Latent Dirichlet Allocation (LDA), Non-Negative Matrix Factorization (NNMF), Latent Semantic Allocation (LSA).
I would recommend you to go through the following resources to read in detail about the algorithms
- Part 2: Topic Modeling and Latent Dirichlet Allocation (LDA) using Gensim and Sklearn
- Beginners Guide to Topic Modeling in Python
- Topic Modelling With LDA -A Hands-on Introduction
Coming to topic modeling, it’s a 2 step process:
- Topic to Term Distribution: Find the most important topics in the corpus.
- Document to Topic Distribution: Assign scores of each topic to each document.
Having understood the topic modeling, we will see how to solve text classification using topic modeling with the help of an example.
Consider a corpus:
- Document 1: I want to have fruits for my breakfast.
- Document 2: I like to eat almonds, eggs, and fruits.
- Document 3: I will take fruits and biscuits with me while going to the Zoo.
- Document 4: The zookeeper feeds the lion very carefully.
- Document 5: One should give good quality biscuits to their dogs.
Topic modeling algorithm (LDA) identifies the most important topics in documents.
- Topic 1: 30% fruits, 15% eggs, 10% biscuits, … (food)
- Topic 2: 20% lion, 10% dogs, 5% zoo, … (animals)
Next, assigns scores of each topic to documents as follows.
This matrix acts like features of the machine learning algorithm. Next, we’ll see about the bag of words.
Bag Of Words
Bag Of Words (BOW) is another popular algorithm for representing text in numbers. It relies on the frequency of the words in the document. BOW has numerous applications like document classification, topic modeling, and text similarity. In BOW, every document is represented as the frequency of words present in the document. So, the frequency of words represents the importance of the words in the document.
Follow the below article for a detailed understanding of Bag Of Words:
In the next section, we will solve the text classification problem in PyCaret.
Case Study – Text Classification using PyCaret
Let us understand the problem statement prior to solving it.
Understanding Problem Statement
Steam is a video game digital distribution service with a vast community of gamers globally. A lot of gamers write reviews on the game page and have the option of choosing whether they would recommend this game to others or not. However, determining this sentiment automatically from the text can help Steam to automatically tag such reviews extracted from other forums across the internet and can help them better judge the popularity of games.
Given the review text with user recommendation, the task is to predict whether the reviewer recommended the game titles available in the test set on the basis of review text and other information.
In simpler terms, the task at hand is to identify whether a given user review is good or bad. You can download the dataset from here.
For classifying the Steam game reviews using PyCaret, I’ve discussed 2 different approaches in the article.
- The first approach uses topic modeling using PyCaret.
- The second approach uses Bag Of Words features. Use these features for classification using PyCaret.
We will implement the BOW approach now.
Note: The tutorial is implemented on Google Colab. I would recommend running the code on the same.
You can install PyCaret just like any other python library.
- Installing PyCaret on Google Colab or Azure Notebooks
As PyCaret doesn’t support count vectorizer, import the module CountVectorizer from sklearn.feature_extraction.
Then, I initialize a CountVectorizer object named ‘tf_vectorizer’.
What exactly does the fit_transform function do to your data?
- “Fit” extracts the features from the dataset.
- “Transform” actually performs the transformations on the dataset.
Let’s convert the output of fit_transform to the data frame.
Now, concatenate the features and target along the column.
Next, we will split the dataset into train and test data.
Now that feature extraction is done. Let’s use these features to build different models. So, the next step is to set up the environment in PyCaret.
Setting up the environment
- This function sets up the training framework and builds the transition pipeline. The setup function must be called before any other function may be called.
- The only mandatory parameter is data and target.
From the above output, we can observe that the metrics of the tuned model are better than the base model metrics.
Evaluate and Predict Model
Here, I’ve predicted the flag values for our processed dataset, ‘tuned_lightgbm’.
PyCaret, which trains machine learning models in a low-code environment, piqued my interest. From your preferred notebook environment, PyCaret helps you to go from preparing data to deploying models in seconds. Before using PyCaret, I tried other traditional methods to solve the JanataHack NLP hackathon problem, but the results weren’t very satisfactory!
PyCaret has proved to be exponentially fast and efficient in comparison to the other open-source machine learning libraries and also has the advantage of replacing several lines of code with just a few words.
Here, if you avoid the first part of my approach where I use the count vectorizer embedding techniques on my dataset and then moved on to setting up and creating models using PyCaret, then you can notice that all the transformations such as one-hot-encoding, imputing missing values, etc, will happen behind the scenes automatically, and then you get a data frame with predictions, just like what we got!
I hope I’ve made clear my overall approach for the hackathon.