5 Techniques to Handle Imbalanced Data For a Classification Problem

Saikat Mazumder 27 May, 2024
10 min read

Introduction

Classification problems are pretty common in the machine-learning world. In the classification problem, we try to predict the class label by studying the input data or predictor where the target or output variable is a categorical variable. If you have already dealt with classification problems, you must have encountered instances where one of the target class labels of observation is significantly lower than other class labels. This dataset type is called an imbalanced class dataset, and it is common in practical classification scenarios. Any usual approach to solving this machine-learning problem often yields inappropriate results. In this article, we will discuss how to handle an imbalanced dataset, the problem regarding its prediction, and how to deal with such data more efficiently than the traditional approach.

This article was published as a part of the Data Science Blogathon.

What is Imbalanced Data, and How to Handle it?

Imbalanced data refers to datasets where the target class has an uneven distribution of observations, i.e., one class label has a very high number of observations, and the other has a deficient number of observations.

We can better understand imbalanced dataset handling by using an example.

Let’s assume that XYZ is a bank that issues credit cards to its customers. Now, the bank is concerned that some fraudulent transactions are going on, and when the bank checks their data, they found that for every 2000 transactions, there are only 30 Nos of fraud recorded. So, the fraud per 100 transactions is less than 2% or more than 98% of transactions is “No Fraud.” Here, the class “No Fraud” is called the majority class, and the much smaller “Fraud” class is called the minority class.

what is Imbalanced Data

More such examples of imbalanced dataset are:

Class imbalance is generally normal in classification problems. But in some cases, this imbalance is quite acute, where the majority class’s presence is much higher than that of the minority class.

Problems with Handling Imbalanced Data Classification

If we explain it simply, the main problem with imbalanced dataset prediction is how accurately we predict both majority and minority classes. Let’s start with an example of disease diagnosis. Now, we will predict disease from an existing dataset where, for every 100 records, only five patients are diagnosed. So, the majority class is 95% with no disease, and the minority class is only 5% with the disease. Now, assume our model predicts that all 100 out of 100 patients have no disease.

Sometimes, when the records of a particular class are much more than those of another class, our classifier may get biased towards the prediction. In this case, the confusion matrix for the classification problem shows how well our model classifies the target classes, and we arrive at the model’s model’s accuracy from the confusion matrix. It is calculated based on the model’s total number of correct predictions divided by the total number of predictions. In the above case, it is (0+95)/(0+95+0+5)=0.95 or 95%. This means that the model fails to identify the minority class, yet the accuracy score of the model will be 95%.

Thus, our traditional approach to classifying and calculating model accuracy is ineffective in the case of an imbalanced dataset.

accuracy matrix

Why is Imbalanced Data a Problem?

Imbalanced dataset is a problem because it can lead to biased models and inaccurate predictions. Here’s why:

  1. Skewed Class Distribution: Imbalanced dataset occurs when one class (the minority class) is significantly underrepresented compared to another class (the majority class) in a classification problem. This can skew the model’s learning process because it may prioritize the majority class, leading to poor performance on the minority class.
  2. Biased Model Training: Machine learning models aim to minimize errors, often measured by metrics like accuracy. In imbalanced datasets, a model can achieve high accuracy by simply predicting the majority class for all instances, ignoring the minority class completely. As a result, the model is biased towards the majority class and fails to capture patterns in the minority class accurately.
  3. Poor Generalization: Imbalanced data can result in models that generalize poorly to new, unseen data, especially for the minority class. Since the model hasn’t learned enough about the minority class due to its scarcity in the training data, it may struggle to make accurate predictions for instances belonging to that class in real-world scenarios.
  4. Costly Errors: In many real-world applications, misclassifying instances from the minority class can be more costly or have higher consequences than misclassifying instances from the majority class. Imbalanced data exacerbates this issue because the model tends to make more errors on the minority class, potentially leading to significant negative impacts.
  5. Evaluation Metrics Misleading: Traditional evaluation metrics like accuracy can be misleading in imbalanced datasets. For instance, a model achieving high accuracy may perform poorly on the minority class, which is often the class of interest. Using metrics like precision, recall, F1-score, or area under the ROC curve (AUC-ROC) can provide a more nuanced understanding of the model’s performance across different classes.

Techniques to Handle Imbalanced Data Set Problem

In rare cases like fraud detection or disease prediction, it is vital to identify the minority classes correctly. So, the model should not be biased to detect only the majority class but should give equal weight or importance to the minority class, too. Here, I discuss some techniques to handle imbalanced dataset problem. There is no correct or wrong method; different techniques work well with other problems.

1. Choose Proper Evaluation Metric

The first technique to handle imbalanced data is choosing a proper evaluation metric. The accuracy of a classifier is the total number of correct predictions divided by the total number of predictions. This may be good enough for a well-balanced class but not ideal for an imbalanced class problem. Other metrics, such as precision, measure how accurate the classifier’s prediction of a specific class, and recall measures the classifier’s ability to identify a class.

For an imbalanced class dataset, the F1 score is a more appropriate metric. It is the harmonic mean of precision and recall and the expression is –

f1 score

So, if the classifier predicts the minority class but the prediction is erroneous and the false-positive increases, the precision metric will be low, and so will the F1 score. Also, if the classifier identifies the minority class poorly, i.e., more of this class wrongfully predicted as the majority class, then false negatives will increase, so recall and F1 score will be low. The F1 score only increases if the number and prediction quality improve.

F1 score keeps the balance between precision and recall and improves the score only if the classifier identifies more of a certain class correctly.

2. Resampling (Oversampling and Undersampling)

The second technique used to handle the imbalanced data is used to upsample or downsample the minority or majority class. When we are using an imbalanced dataset, we can oversample the minority class using replacement. This technique used to handle imbalanced data is called oversampling. Similarly, we can randomly delete rows from the majority class to match them with the minority class which is called undersampling. After sampling the data we can get a balanced dataset for both majority and minority classes. So, when both classes have a similar number of records present in the dataset, we can assume that the classifier will give equal importance to both classes.

undersample Imbalanced Data
oversampled data

An example of this technique using the sklearn library; it is shown below for illustration purposes. Here, Is_Lead is our target variable. Let’s see the distribution of the classes in the target.

valuecounts

It has been observed that our target class is imbalanced. So, we’ll upsample the data so that the minority class matches the majority class.

from sklearn.utils import resample
#create two different dataframe of majority and minority class 
df_majority = df_train[(df_train['Is_Lead']==0)] 
df_minority = df_train[(df_train['Is_Lead']==1)] 
# upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,    # sample with replacement
                                 n_samples= 131177, # to match majority class
                                 random_state=42)  # reproducible results
# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_minority_upsampled, df_majority])

After upsampling, the distribution of class is balanced as below –

resampled data Imbalanced Data

Sklearn.utils resample can be used for both undersamplings the majority class and oversample minority class instances.

3. SMOTE

The third technique to handle imbalanced data is the Synthetic Minority Oversampling Technique or SMOTE, which is another technique to oversample the minority class. Simply adding duplicate records of minority class often don’t adon’ty new information to the model. In SMOTE new instances are synthesized from the existing data. If we explain it in simple words, SMOTE looks into minority class instances and use k nearest neighbor to select a random nearest neighbor, and a synthetic instance is created randomly in feature space.

I am going to show the code sample of the same below:

from imblearn.over_sampling import SMOTE
# Resampling the minority class. The strategy can be changed as required.
sm = SMOTE(sampling_strategy='minority', random_state=42)
# Fit the model to generate the data.
oversampled_X, oversampled_Y = sm.fit_sample(df_train.drop('Is_Lead', axis=1), df_train['Is_Lead'])
oversampled = pd.concat([pd.DataFrame(oversampled_Y), pd.DataFrame(oversampled_X)], axis=1)

Now the class has been balanced as below

oversampled data

4. BalancedBaggingClassifier

When we try to use a usual classifier to classify an imbalanced dataset, the model favors the majority class due to its larger volume presence. A BalancedBaggingClassifier is the same as a sklearn classifier but with additional balancing. It includes an additional step to balance the training set at the time of fit for a given sampler. This classifier takes two special parameters, “sampling_strategy” and “replacement”. The sampling_strategy decides the type of resampling required (e.g., ‘majority’ – resample only the majority class, ‘all’ – resample all classes, etc.), and replacement decides whether it is going to be a sample with replacement or not.

An illustrative example is given below:

from imblearn.ensemble import BalancedBaggingClassifier
from sklearn.tree import DecisionTreeClassifier
#Create an instance
classifier = BalancedBaggingClassifier(base_estimator=DecisionTreeClassifier(),
                                sampling_strategy='not majority',
                                replacement=False,
                                random_state=42)
classifier.fit(X_train, y_train)
preds = classifier.predict(X_test)

5. Threshold Moving

In the case of our classifiers, many times classifiers actually predict the probability of class membership. We assign those prediction’s abilities to a certain class based on a threshold which is usually 0.5, i.e. if the probabilities < 0.5 it belongs to a certain class, and if not it belongs to the other class.

For imbalanced class problems, this default threshold may not work properly. We need to change the threshold to the optimum value so that it can efficiently separate two classes. Also, we can use ROC Curves and Precision-Recall Curves to find the optimal threshold for the classifier. We can also use a grid search method or search within a set of values to identify the optimal value.

Searching Optimal Value From a Grid

In this method first, we will find the probabilities for the class label, then we’ll fwe’llhe optimum threshold to map the probabilities to its proper class label. The probability of prediction can be obtained from a classifier by using predict_proba() method from sklearn.

rom sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier()
rf_model.fit(X_train,y_train)   
rf_model.predict_proba(X_test) #probability of the class label
Output:

array([[0.97, 0.03],
       [0.94, 0.06],
       [0.78, 0.22],
       ...,
       [0.95, 0.05],
       [0.11, 0.89],
       [0.72, 0.28]])
After getting the probability we can check for the optimum value.

step_factor = 0.05 
threshold_value = 0.2 
roc_score=0
predicted_proba = rf_model.predict_proba(X_test) #probability of prediction
while threshold_value <=0.8: #continue to check best threshold upto probability 0.8
    temp_thresh = threshold_value
    predicted = (predicted_proba [:,1] >= temp_thresh).astype('int') #change the class boundary for prediction
    print('Threshold',temp_thresh,'--',roc_auc_score(y_test, predicted))
    if roc_score<roc_auc_score(y_test, predicted): #store the threshold for best classification
        roc_score = roc_auc_score(y_test, predicted)
        thrsh_score = threshold_value
    threshold_value = threshold_value + step_factor
print('---Optimum Threshold ---',thrsh_score,'--ROC--',roc_score)

Output:

threshhold

Here, we get the optimal threshold in 0.3 instead of our default 0.5.

Conclusion

In conclusion, dealing with imbalanced datasets in classification problems poses significant challenges that traditional approaches often fail to address effectively. The skewed distribution of classes can lead to biased models, inaccurate predictions, and poor generalization of new data. Moreover, the misleading nature of traditional evaluation metrics like accuracy exacerbates these issues, making adopting alternative metrics such as precision, recall, F1-score, or AUC-ROC crucial.

To overcome these challenges, various techniques can be employed, including proper selection of evaluation metrics, resampling methods like oversampling and undersampling, utilizing algorithms designed for imbalance such as SMOTE, employing ensemble methods like BalancedBaggingClassifier, and adjusting threshold values for optimal classification. Each technique offers unique advantages and may be more suitable depending on the specific characteristics of the dataset and the problem at hand.

By understanding the complexities of imbalanced datasets and implementing appropriate strategies for handling them, machine learning practitioners can improve the performance and reliability of their models. This will ultimately lead to more accurate predictions and better decision-making in real-world applications.

For those looking to enhance their analytics skills and dive deeper into data science, consider enrolling in Analytics Vidhya’s Program, a comprehensive learning platform for aspiring data scientists.

Frequently Asked Questions

Q1. What are the 3 ways to handle an imbalanced data set?

A. Three ways to handle an imbalanced data set are: 

a) Resampling: Over-sampling the minority class, under-sampling the majority class, or generating synthetic samples. 
b) Using different evaluation metrics: F1-score, AUC-ROC, or precision-recall. 
c) Algorithm selection: Choose algorithms designed for imbalance, like SMOTE or ensemble methods.

Q2. Which algorithm handle imbalanced data?

A. Several algorithms are capable of handling imbalanced data effectively. Random Forest, for instance, can manage class imbalance through bagging and feature selection. SVM can be adjusted by assigning class weights to penalize errors in the minority class. SMOTE generates synthetic samples for the minority class, aiding in balancing the dataset and improving model performance.

Q3. What happens if dataset is imbalanced?

A. When a dataset is imbalanced, several issues may arise. Models may exhibit bias toward the majority class, resulting in poor predictions for the minority class. Accuracy as an evaluation metric can be misleading, as it may appear high while the model’smodel’smance on the minority class is lacking. In real-world applications, dealing with imbalanced data can pose significant challenges, potentially affecting decision-making, particularly in critical domains where accurate predictions are essential.

The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.

Saikat Mazumder 27 May, 2024

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

oh translate
oh translate 26 Oct, 2023

This blog post provides an informative overview of imbalanced data and the techniques used to handle it, which can be useful for those working with large datasets that contain both common and rare events.