SMOTE for Imbalanced Classification with Python

SWASTIK Last Updated : 16 Oct, 2024

8 min read

Introduction

Imbalanced datasets pose a common challenge for machine learning practitioners in binary classification problems. This scenario frequently arises in practical business applications like fraud detection, spam filtering, rare disease discovery, and hardware fault detection. To address this issue, one popular technique is Synthetic Minority Oversampling Technique (SMOTE). SMOTE is specifically designed to tackle imbalanced datasets by generating synthetic samples for the minority class. This article explores the significance of SMOTE in dealing with class imbalance, focusing on its application in improving the performance of classifier models. By mitigating bias and capturing important features of the minority class, SMOTE contributes to more accurate predictions and better model performance. In this article, you will get a proper understanding about smote , smote machine learning, how smote algorithm works, SMOTE technique, smote sampling.

In this article, you will learn about SMOTE in machine learning and why it’s important. We’ll talk about how to use SMOTE in Python and what the term SMOTE really means. You’ll also find out how the SMOTE technique helps by creating more examples of the smaller class, which is great for improving your models. By the end, you’ll feel ready to use SMOTE in your own projects!

This article was published as a part of the Data Science Blogathon.

What is the function of Smote?
The Accuracy Paradox
Dealing with Imbalanced Data
Performance Analysis after Resampling
Conclusion

What is the function of Smote?

The Function of Smote here we have given in following steps:

SMOTE stands for Synthetic Minority Oversampling Technique. It’s a technique used in machine learning to address imbalanced datasets.

Identify the Imbalance: You start by recognizing that your data has a minority class, like rare disease cases in a medical dataset.
Focus on the minority: SMOTE specifically creates new data points for the minority class, not the majority.
Create synthetic samples: It analyzes existing minority data points and generates new ones similar to them.
Increase minority: By adding these synthetic samples, SMOTE balances the data, giving the model a better chance to learn the minority class.

The Accuracy Paradox

Suppose, you’re working on a health insurance based fraud detection problem. In such problems, we generally observe that in every 100 insurance claims 99 of them are non-fraudulent and 1 is fraudulent. So a binary classifier model need not be a complex model to predict all outcomes as 0 meaning non-fraudulent and achieve a great accuracy of 99%. Clearly, in such cases where class distribution is skewed, the accuracy metric is biased and not preferable.

Dealing with Imbalanced Data

Resampling data is one of the most commonly preferred approaches to deal with an imbalanced dataset. There are broadly two types of methods for this i) Undersampling ii) Oversampling. In most cases, oversampling is preferred over undersampling techniques. The reason being, in undersampling we tend to remove instances from data that may be carrying some important information. In this article, I am specifically covering some special data augmentation oversampling techniques: SMOTE and its related counterparts.

SMOTE: Synthetic Minority Oversampling Technique

SMOTE is an oversampling technique where the synthetic samples are generated for the minority class. This algorithm helps to overcome the overfitting problem posed by random oversampling. It focuses on the feature space to generate new instances with the help of interpolation between the positive instances that lie together.

Working Procedure

At first the total no. of oversampling observations, N is set up. Generally, it is selected such that the binary class distribution is 1:1. But that could be tuned down based on need. Then the iteration starts by first selecting a positive class instance at random. Next, the KNN’s (by default 5) for that instance is obtained. At last, N of these K instances is chosen to interpolate new synthetic instances. To do that, using any distance metric the difference in distance between the feature vector and its neighbors is calculated. Now, this difference is multiplied by any random value in (0,1] and is added to the previous feature vector. This is pictorially represented below:

Python Code for SMOTE Algorithm

from imblearn.over_sampling import SMOTE
from collections import Counter

counter = Counter(y_train)
print('Before', counter)

# oversampling the train dataset using SMOTE
smt = SMOTE()
X_train_sm, y_train_sm = smt.fit_resample(X_train, y_train)

counter = Counter(y_train_sm)
print('After', counter)

Though this algorithm is quite useful, it has few drawbacks associated with it.

The synthetic instances generated are in the same direction i.e. connected by an artificial line its diagonal instances. This in turn complicates the decision surface generated by few classifier algorithms.
SMOTE tends to create a large no. of noisy data points in feature space.

ADASYN: Adaptive Synthetic Sampling Approach

ADASYN is a generalized form of the SMOTE algorithm. This algorithm also aims to oversample the minority class by generating synthetic instances for it. But the difference here is it considers the density distribution, r_iwhich decides the no. of synthetic instances generated for samples which difficult to learn. Due to this, it helps in adaptively changing the decision boundaries based on the samples difficult to learn. This is the major difference compared to SMOTE.

Working Procedure

From the dataset, the total no. of majority N^– and minority N⁺ are captured respectively. Then we preset the threshold value, d^thfor the maximum degree of class imbalance. Total no. of synthetic samples to be generated, G = (N^–– N⁺) x β. Here, β = (N⁺/ N^–).
For every minority sample x_i, KNN’s are obtained using Euclidean distance, and ratio r_iis calculated as Δi/k and further normalized as r_x<= r_i/ ∑ rᵢ.
Thereafter, the total synthetic samples for each x_i will be, g_i= r_x x G. Now we iterate from 1 to g_i to generate samples the same way as we did in SMOTE.

The below-given diagram represents the above procedure:

Python Code for ADASYN Algorithm

from imblearn.over_sampling import ADASYN
from collections import Counter

# Counting the number of instances in each class before oversampling
counter = Counter(y_train)
print('Before', counter)

# Oversampling the train dataset using ADASYN
ada = ADASYN(random_state=130)
X_train_ada, y_train_ada = ada.fit_resample(X_train, y_train)

# Counting the number of instances in each class after oversampling
counter = Counter(y_train_ada)
print('After', counter)

Hybridization: SMOTE + Tomek Links

Hybridization techniques involve combining both undersampling and oversampling techniques. This is done to optimize the performance of classifier models for the samples created as part of these techniques.

SMOTE+TOMEK is such a hybrid technique that aims to clean overlapping data points for each of the classes distributed in sample space. After the oversampling is done by SMOTE, the class clusters may be invading each other’s space. As a result, the classifier model will be overfitting. Now, Tomek links are the opposite class paired samples that are the closest neighbors to each other. Therefore the majority of class observations from these links are removed as it is believed to increase the class separation near the decision boundaries. Now, to get better class clusters, Tomek links are applied to oversampled minority class samples done by SMOTE. Thus instead of removing the observations only from the majority class, we generally remove both the class observations from the Tomek links.

Python Code for the SMOTE + Tomek Algorithm

from imblearn.combine import SMOTETomek
from collections import Counter

# Counting the number of instances in each class before oversampling
counter = Counter(y_train)
print('Before', counter)

# Oversampling the train dataset using SMOTE + Tomek
smtom = SMOTETomek(random_state=139)
X_train_smtom, y_train_smtom = smtom.fit_resample(X_train, y_train)

# Counting the number of instances in each class after oversampling
counter = Counter(y_train_smtom)
print('After', counter)

Hybridization: SMOTE + ENN

SMOTE + ENN is another hybrid technique where more no. of observations are removed from the sample space. Here, ENN is yet another undersampling technique where the nearest neighbors of each of the majority class is estimated. If the nearest neighbors misclassify that particular instance of the majority class, then that instance gets deleted.

Integrating this technique with oversampled data done by SMOTE helps in doing extensive data cleaning. Here on misclassification by NN’s samples from both the classes are removed. This results in a more clear and concise class separation.

Python Code for SMOTE + ENN Algorithm

from imblearn.combine import SMOTETomek
from collections import Counter

# Counting the number of instances in each class before oversampling
counter = Counter(y_train)
print('Before', counter)

# Oversampling the train dataset using SMOTE + Tomek
smtom = SMOTETomek(random_state=139)
X_train_smtom, y_train_smtom = smtom.fit_resample(X_train, y_train)

# Counting the number of instances in each class after oversampling
counter = Counter(y_train_smtom)
print('After', counter)

The below-given picture shows how different SMOTE based resampling techniques work out to deal with imbalanced data.

Performance Analysis after Resampling

To understand the effect of oversampling, I will be using a bank customer churn dataset. It is an imbalanced data where the target variable, churn has 81.5% customers not churning and 18.5% customers who have churned.

A comparative analysis was done on the dataset using 3 classifier models: Logistic Regression, Decision Tree, and Random Forest. As discussed earlier, we’ll ignore the accuracy metric to evaluate the performance of the classifier on this imbalanced dataset. Here, we are more interested to know that which are the customers who’ll churn out in the coming months. Thereby, we’ll focus on metrics like precision, recall, F1-score to understand the performance of the classifiers for correctly determining which customers will churn.

Note: The SMOTE and its related techniques are only applied to the training dataset so that we fit our algorithm properly on the data. The test data remains unchanged so that it correctly represents the original data.

From the above, it can be seen on the actual imbalanced dataset, all 3 classifier models were not able to generalize well on the minority class compared to the majority class. As a result, most of the negative class samples were correctly classified. Due to this, there was less FP compared to more FN. After oversampling, a clear surge in Recall is seen on the test data. To understand this better, a comparative barplot is shown below for all 3 models:

There is a decrease in precision, but by achieving a much recall which satisfies the objective of any binary classification problem. Also, the AUC-ROC and F1-score for each model remain more or less the same.

Conclusion

The issue of class imbalance is just not limited to binary classification problems, multi-class classification problems equally suffer with it. Therefore, it is important to apply resampling techniques to such data so as the models perform to their best and give most of the accurate predictions.

You can check the entire implementation in my GitHub repository and try to apply them at your end. Do explore other techniques that help in handling an imbalanced dataset.

Hope you like the article and get a clear understanding for the smote and smote machine learning , how smote algorithm works and also you will get to know smote in python. So in this artic you will clear your doubts.

Q1. What is SMOTE?

A. SMOTE is an oversampling technique that generates synthetic samples from the minority class. It obtains a synthetically class-balanced or nearly class-balanced training set, then trains the classifier.

Q2. What is smote used for?

A. Smote is used for synthetic minority oversampling in machine learning. It generates synthetic samples to balance imbalanced datasets, specifically targeting the minority class.

Q3. When should you use smote?

A. Smote should be used when dealing with imbalanced datasets to improve the performance of machine learning models on minority class predictions.

Q4. How to use smote in a sentence?

A. The warrior smote the beast with a mighty blow.

Q5. What is an example of smote sampling?

Imagine you have a dataset with few spam emails (minority class). SMOTE creates new synthetic spam emails based on existing ones, balancing the dataset for better spam detection.

References:

Learning from Imbalanced Data Sets by Alberto Fernández, Salvador García, Mikel Galar, Ronaldo C. Prati, Bartosz Krawczyk, Francisco Herrera

SWASTIK

Classification Intermediate Machine Learning Python Structured Data

Free Courses

4.7

Generative AI - A Way of Life

Explore Generative AI for beginners: create text and images, use top AI tools, learn practical skills, and ethics.

4.5

Getting Started with Large Language Models

Master Large Language Models (LLMs) with this course, offering clear guidance in NLP and model training made simple.

4.6

Building LLM Applications using Prompt Engineering

This free course guides you on building LLM apps, mastering prompt engineering, and developing chatbots with enterprise data.

4.8

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Explore practical solutions, advanced retrieval strategies, and agentic RAG systems to improve context, relevance, and accuracy in AI-driven applications.

4.7

Microsoft Excel: Formulas & Functions

Master MS Excel for data analysis with key formulas, functions, and LookUp tools in this comprehensive course.

MUID

Used by Microsoft Clarity, to store and track visits across websites.

Expiry: 1 Year

Type: HTTP

_clck

Used by Microsoft Clarity, Persists the Clarity User ID and preferences, unique to that site, on the browser. This ensures that behavior in subsequent visits to the same site will be attributed to the same user ID.

Expiry: 1 Year

Type: HTTP

_clsk

Used by Microsoft Clarity, Connects multiple page views by a user into a single Clarity session recording.

Expiry: 1 Day

Type: HTTP

SRM_I

Collects user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Years

Type: HTTP

SM

Use to measure the use of the website for internal analytics

Expiry: 1 Years

Type: HTTP

CLID

The cookie is set by embedded Microsoft Clarity scripts. The purpose of this cookie is for heatmap and session recording.

Expiry: 1 Year

Type: HTTP

SRM_B

Collected user data is specifically adapted to the user or device. The user can also be followed outside of the loaded website, creating a picture of the visitor's behavior.

Expiry: 2 Months

Type: HTTP

_gid

This cookie is installed by Google Analytics. The cookie is used to store information of how visitors use a website and helps in creating an analytics report of how the website is doing. The data collected includes the number of visitors, the source where they have come from, and the pages visited in an anonymous form.

Expiry: 399 Days

Type: HTTP

_ga_#

Used by Google Analytics, to store and count pageviews.

Expiry: 399 Days

Type: HTTP

_gat_#

Used by Google Analytics to collect data on the number of times a user has visited the website as well as dates for the first and most recent visit.

Expiry: 1 Day

Type: HTTP

collect

Used to send data to Google Analytics about the visitor's device and behavior. Tracks the visitor across devices and marketing channels.

Expiry: Session

Type: PIXEL

AEC

cookies ensure that requests within a browsing session are made by the user, and not by other sites.

Expiry: 6 Months

Type: HTTP

G_ENABLED_IDPS

use the cookie when customers want to make a referral from their gmail contacts; it helps auth the gmail account.

Expiry: 2 Years

Type: HTTP

test_cookie

This cookie is set by DoubleClick (which is owned by Google) to determine if the website visitor's browser supports cookies.

Expiry: 1 Year

Type: HTTP

_we_us

this is used to send push notification using webengage.

Expiry: 1 Year

Type: HTTP

WebKlipperAuth

used by webenage to track auth of webenagage.

Expiry: Session

Type: HTTP

ln_or

Linkedin sets this cookie to registers statistical data on users' behavior on the website for internal analytics.

Expiry: 1 Day

Type: HTTP

JSESSIONID

Use to maintain an anonymous user session by the server.

Expiry: 1 Year

Type: HTTP

li_rm

Used as part of the LinkedIn Remember Me feature and is set when a user clicks Remember Me on the device to make it easier for him or her to sign in to that device.

Expiry: 1 Year

Type: HTTP

AnalyticsSyncHistory

Used to store information about the time a sync with the lms_analytics cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

lms_analytics

Used to store information about the time a sync with the AnalyticsSyncHistory cookie took place for users in the Designated Countries.

Expiry: 6 Months

Type: HTTP

liap

Cookie used for Sign-in with Linkedin and/or to allow for the Linkedin follow feature.

Expiry: 6 Months

Type: HTTP

visit

allow for the Linkedin follow feature.

Expiry: 1 Year

Type: HTTP

li_at

often used to identify you, including your name, interests, and previous activity.

Expiry: 2 Months

Type: HTTP

s_plt

Tracks the time that the previous page took to load

Expiry: Session

Type: HTTP

lang

Used to remember a user's language setting to ensure LinkedIn.com displays in the language selected by the user in their settings

Expiry: Session

Type: HTTP

s_tp

Tracks percent of page viewed

Expiry: Session

Type: HTTP

AMCV_14215E3D5995C57C0A495C55%40AdobeOrg

Indicates the start of a session for Adobe Experience Cloud

Expiry: Session

Type: HTTP

s_pltp

Provides page name value (URL) for use by Adobe Analytics

Expiry: Session

Type: HTTP

s_tslv

Used to retain and fetch time since last visit in Adobe Analytics

Expiry: 6 Months

Type: HTTP

li_theme

Remembers a user's display preference/theme setting

Expiry: 6 Months

Type: HTTP

li_theme_set

Remembers which users have updated their display / theme preferences

Expiry: 6 Months

Type: HTTP

Reading list

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

Naive Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices

SMOTE for Imbalanced Classification with Python

Introduction

Table of contents

What is the function of Smote?

The Accuracy Paradox

Dealing with Imbalanced Data

SMOTE: Synthetic Minority Oversampling Technique

Working Procedure

Python Code for SMOTE Algorithm

ADASYN: Adaptive Synthetic Sampling Approach

Working Procedure

Python Code for ADASYN Algorithm

Hybridization: SMOTE + Tomek Links

Python Code for the SMOTE + Tomek Algorithm

Hybridization: SMOTE + ENN

Python Code for SMOTE + ENN Algorithm

Performance Analysis after Resampling

Conclusion

Free Courses

Generative AI - A Way of Life

Getting Started with Large Language Models

Building LLM Applications using Prompt Engineering

Improving Real World RAG Systems: Key Challenges & Practical Solutions

Microsoft Excel: Formulas & Functions

Recommended Articles

Responses From Readers

Write for us

Analytics Vidhya (4)

brahmaid

csrftoken

Identityid

sessionid

Google (1)

g_state

Microsoft (7)

MUID

_clck

_clsk