One Hot Encoding vs. Label Encoding in Machine Learning

Alakh Sethi 28 May, 2024 â€¢ 8 min read

Introduction

When working with categorical data in machine learning, it’s crucial to convert these variables into a numerical format that algorithms can understand. Two commonly used techniques for encoding categorical variables are one-hot encoding (OHE) and label encoding. Choosing the appropriate encoding method can significantly impact the performance of a machine learning model. In this article, we will explore the differences between one-hot encoding and label encoding, their use cases, and how to implement them using the Pandas and Scikit-Learn libraries in Python.

What is Categorical Encoding?

A structured dataset typically includes a mix of numerical and categorical variables. Machine learning algorithms can only process numerical data, not text. This is where categorical encoding comes into play.

Categorical encoding converts categorical columns into numerical columns, allowing machine learning algorithms to interpret and process the data effectively.

Different Approaches to Categorical Encoding

So, how should we handle categorical variables? There are several methods, but this article will focus on the two most widely used techniques:

1. Label Encoding
2. One-Hot Encoding (OHE)

These techniques are essential for preparing your categorical data for machine learning models, ensuring they can learn and make predictions accurately.

Checkout our course on Applied Machine Learning – Beginner to Professional to know everything about ML functions!

What is Label Encoding?

Label Encoding is a common technique for converting categorical variables into numerical values. Each unique category value is assigned a unique integer based on alphabetical or numerical ordering.

Let’s walk through how to implement label encoding using both Pandas and the Scikit-Learn libraries in Python:

Implementing Label Encoding using Pandas

Step1: Import the Required Libraries and Dataset

``````#importing the libraries
import pandas as pd
import numpy as np

Output:

Understanding the datatypes of features:

``````#Checking the data types of all columns

df.info()``````

Output:

From the output, we see that the first column, `Country`, is a categorical feature represented by the object data type, while the remaining columns are numerical features represented by `float`.

Step 2: Implement Label Encoding

Now that we have already imported the dataset earlier, letâ€™s go ahead and implement Label encoder using scikit-learn.

``````# Import label encoder
from sklearn import preprocessing

# Create a label encoder object
label_encoder = preprocessing.LabelEncoder()

# Encode labels in the 'Country' column
df['Country'] = label_encoder.fit_transform(df['Country'])

Output:

Again, `Country` values are transformed into integers.

Challenges with Label Encoding

Label encoding imposes an arbitrary order on categorical data, which can be misleading. In the given example, the countries have no inherent order, but label encoding introduces an ordinal relationship based on the encoded integers (e.g., France < Germany < Spain). This can cause the model to falsely interpret these categories as having a meaningful order, potentially leading to incorrect inferences.

By understanding and implementing label encoding with both Pandas and Scikit-Learn, you can efficiently convert categorical data for machine learning models while being aware of its limitations and the potential for misinterpretation.

What is One Hot Encoding?

One-Hot Encoding is another popular technique for treating categorical variables. It simply creates additional features based on the number of unique values in the categorical feature. Every unique value in the category will be added as a feature. One-Hot Encoding is the process of creating dummy variables.

Implementing One-Hot Encoding in Python using Scikit-Learn

Here’s how you can implement one-hot encoding using Scikit-Learn in Python:

Import the Necessary Libraries

``````from sklearn.preprocessing import OneHotEncoder

import pandas as pd``````

Create a OneHotEncoder Object and Transform the Categorical Data

``````# creating one hot encoder object
onehotencoder = OneHotEncoder()

# reshape the 1-D country array to 2-D as fit_transform expects 2-D and fit the encoder
X = onehotencoder.fit_transform(df.Country.values.reshape(-1, 1)).toarray()
``````

Convert the Transformed Data into a DataFrame and Concatenate it with the Original DataFrame

``````# Creating a DataFrame with the encoded data
dfOneHot = pd.DataFrame(X, columns=["Country_" + str(int(i)) for i in range(X.shape[1])])

# Concatenating the original DataFrame with the encoded DataFrame
df = pd.concat([df, dfOneHot], axis=1)

# Dropping the original 'Country' column
df = df.drop(['Country'], axis=1)

# Displaying the first few rows of the updated DataFrame

Output:

As you can see, three new features are added because the `Country` column contains three unique values â€“ France, Spain, and Germany. This method avoids the problem of ranking inherent in label encoding, as each category is represented by a separate binary vector.

Implementing One-Hot Encoding using Pandas

Hereâ€™s how you can implement one-hot encoding using Pandas.

Import the Necessary Library

``import pandas as pd``

Use the get_dummies Method to Perform One-hot Encoding

``````# One-Hot Encoding using Pandas
df = pd.get_dummies(df, columns=['Country'], dtype='int')

# Displaying the first few rows of the updated DataFrame

Output:

Using Pandas’ `get_dummies` method, you can achieve the same result with fewer steps. This method automatically handles the conversion of the categorical `Country` column into multiple binary columns.

Also, specifying the data type as `int` is important because, by default, `get_dummies` will return boolean values (`True` or `False`). Setting `dtype='int'` ensures the new columns contain integer values instead.

Can you see any drawbacks with this approach? Think about it before reading on.

Challenges of One-Hot Encoding: Dummy Variable Trap

One-Hot Encoding results in a Dummy Variable Trap as the outcome of one variable can easily be predicted with the help of the remaining variables. Dummy Variable Trap is a scenario in which variables are highly correlated to each other.

The Dummy Variable Trap leads to the problem known as multicollinearity. Multicollinearity occurs where there is a dependency between the independent features. Multicollinearity is a serious issue in machine learning models like Linear Regression and Logistic Regression.

So, in order to overcome the problem of multicollinearity, one of the dummy variables has to be dropped. Here, I will practically demonstrate how the problem of multicollinearity is introduced after carrying out the one-hot encoding.

One of the common ways to check for multicollinearity is the Variance Inflation Factor (VIF):

• VIF=1, Very Less Multicollinearity
• VIF<5, Moderate Multicollinearity
• VIF>5, Extreme Multicollinearity (This is what we have to avoid)

Compute the VIF scores:

``````# Function to calculate VIF
def calculate_vif(data):
vif_df = pd.DataFrame(columns = ['Var', 'Vif'])
x_var_names = data.columns
for i in range(0, x_var_names.shape[0]):
y = data[x_var_names[i]]
x = data[x_var_names.drop([x_var_names[i]])]
r_squared = sm.OLS(y,x).fit().rsquared
vif = round(1/(1-r_squared),2)
vif_df.loc[i] = [x_var_names[i], vif]
return vif_df.sort_values(by = 'Vif', axis = 0, ascending=False, inplace=False)

X=df.drop(['Salary'],axis=1)
calculate_vif(X)``````

Output:

From the output, we can see that the dummy variables which are created using one-hot encoding have VIF above 5. We have a multicollinearity problem.

Now, let us drop one of the dummy variables to solve the multicollinearity issue:

``````df = df.drop(df.columns[[0]], axis=1)
calculate_vif(df)``````

Output:

Wow! VIF has decreased. We solved the problem of multicollinearity. Now, the dataset is ready for building the model.

We recommend you to go through Going Deeper into Regression Analysis with Assumptions, Plots & Solutions for understanding the assumptions of linear regression.

When to Use a Label Encoding vs. One Hot Encoding

This question generally depends on your dataset and the model which you wish to apply. But still, a few points to note before choosing the right encoding technique for your model:

We apply One-Hot Encoding when:

• The categorical feature is not ordinal (like the countries above)
• The number of categorical features is less so one-hot encoding can be effectively applied

We apply Label Encoding when:

• The categorical feature is ordinal (like Jr. kg, Sr. kg, Primary school, high school)
•  The number of categories is quite large as one-hot encoding can lead to high memory consumptionLabel Encoding vs One Hot Encoding vs Ordinal Encoding

Conclusion

Understanding the differences between one-hot encoding vs label encoding is crucial for effectively handling categorical data in machine learning projects. By mastering these encoding methods and implementing Scikit-Learn, data scientists can enhance their skills and deliver more robust and accurate ML solutions.

One way to master all the data science skills is with our Blackbelt program. It offers comprehensive training in data science, including topics like feature engineering, encoding techniques, and more. Explore the program to know more!

Q1. What is better, label encoding or one-hot encoding?

A. Label encoding and one-hot encoding are methods for handling categorical variables in machine learning. The choice between them depends on the specific dataset and the ML algorithm you use. One-hot encoding can lead to a sparse matrix with many zeros, which is typical in pandas DataFrames when using `pandas.get_dummies`.

Q2. Why not use label encoding?

A. Label encoding is simpler and more space-efficient, but it may introduce an arbitrary order to categorical values. One-hot encoding avoids this issue by creating binary columns for each category, but it can lead to high-dimensional data. Using the `onehotencoder` class from sklearn’s `column transformer` can help manage this process, and the `handle_unknown` parameter can be set to ‘ignore’ to manage unknown categories.

Q3. What is the difference between target encoding vs label encoding?

A. Target encoding uses the target variable to encode categorical features, while label encoding assigns unique labels to each category. Target encoding can capture target-related information but may introduce data leakage and overfitting risks. In comparison, label encoding can be efficiently applied using numpy arrays, which can be imported using `import numpy as np`.

Q4. What are the benefits of using get_feature_names_out and inverse_transform methods in encoding?

A. The `get_feature_names_out` method is useful for retrieving the names of features after encoding, which helps in understanding the transformed data. The `inverse_transform` method allows you to revert the encoded data back to its original form, which is useful for interpretation and debugging. These methods are part of the `onehotencoder` class and `labelbinarizer`.

Q5. How can column transformers help in preprocessing?

A. Column transformers allow you to apply different preprocessing steps to different columns in a pandas DataFrame. This is particularly useful when you have a mix of numerical and categorical data. You can use transformers like OneHotEncoder, LabelEncoder, and StandardScaler to preprocess your data efficiently.

Q6. Implementing one-hot encoding using pandas vs scikit-learn, which one is preferred and why?

A. Implementing one-hot encoding using pandas (pandas.get_dummies) is straightforward and integrates well with pandas DataFrames. However, scikit-learn’s OneHotEncoder class is preferred for machine learning pipelines because it offers more flexibility, such as handling unknown categories with the handle_unknown parameter, creating sparse matrices, and integrating with other scikit-learn preprocessing and modeling tools. Additionally, OneHotEncoder supports the use of column transformer for applying different preprocessing steps to different columns, making it more versatile for complex data preprocessing tasks.

Alakh Sethi 28 May 2024

Aspiring Data Scientist with a passion to play and wrangle with data and get insights from it to help the community know the upcoming trends and products for their better future.With an ambition to develop product used by millions which makes their life easier and better.

Akanksha 06 Mar, 2020

Hi Alakh, Wanted to ask about the case where variable is not ordinal but number of categories is very large. How to treat those categorical variables.

Ram 08 Aug, 2020

For decision tree algorithms like random forest, even if the categorical variable is nominal, it doesn't seem to have a problem with being represented as ordinal using label or ordinal encoder. Seems unintuitive. Can someone please explain? H20 infact says that they use enum encoding where the categories are given a numerical value , but the numbers themselves are irrelevant(hence not imposing ordinality on nominal variables). But their classification performance doesn't seem to be much different from sklearn's random forest classifier using ordinal encoder)

Celeste Short 23 Sep, 2020

Thank you.

vamsi 25 Sep, 2020

alakh ...concept is explained in a simple way.

Ali Parahoo 23 Oct, 2020

Hi! Thank you for this very precise article. When we are using decision trees, is it safe to say that one-hot encoding is not recommended? Ali

Abhishek Sau 19 Apr, 2022

What if the categorical variables are not ordinal but The number of categories is quite large?

Daren Purnell 15 Jul, 2022

Thank you !!!!!; please teach at Northwestern U