Feature Transformation and Scaling Techniques to Boost Your Model Performance

Purva Huilgol 15 Aug, 2020 • 10 min read

Overview

Understand the requirement of feature transformation and scaling techniques
Get to know different feature transformation and scaling techniques including-
- MinMax Scaler
- Standard Scaler
- Power Transformer Scaler
- Unit Vector Scaler/Normalizer

Introduction

In my machine learning journey, more often than not, I have found that feature preprocessing is a more effective technique in improving my evaluation metric than any other step, like choosing a model algorithm, hyperparameter tuning, etc.

Feature preprocessing is one of the most crucial steps in building a Machine learning model. Too few features and your model won’t have much to learn from. Too many features and we might be feeding unnecessary information to the model. Not only this, but the values in each of the features need to be considered as well.

We know that there are some set rules of dealing with categorical data, as in, encoding them in different ways. However, a large chunk of the process involves dealing with continuous variables. There are various methods of dealing with continuous variables. Some of them include converting them to a normal distribution or converting them to categorical variables, etc.

Feature Transformation and Scaling

There are a couple of go-to techniques I always use regardless of the model I am using, or whether it is a classification task or regression task, or even an unsupervised learning model. These techniques are:

Feature Transformation and
Feature Scaling.

To get started with Data Science and Machine Learning, check out our course – Applied Machine Learning – Beginner to Professional

Why do we need Feature Transformation and Scaling?
MinMax Scaler
Standard Scaler
MaxAbsScaler
Robust Scaler
Quantile Transformer Scaler
Log Transformation
Power Transformer Scaler
Unit Vector Scaler/Normalizer

Why do we need Feature Transformation and Scaling?

Oftentimes, we have datasets in which different columns have different units – like one column can be in kilograms, while another column can be in centimeters. Furthermore, we can have columns like income which can range from 20,000 to 100,000, and even more; while an age column which can range from 0 to 100(at the most). Thus, Income is about 1,000 times larger than age.

But how can we be sure that the model treats both these variables equally? When we feed these features to the model as is, there is every chance that the income will influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor. So, to give importance to both Age, and Income, we need feature scaling.

In most examples of machine learning models, you would have observed either the Standard Scaler or MinMax Scaler. However, the powerful sklearn library offers many other feature transformations scaling techniques as well, which we can leverage depending on the data we are dealing with. So, what are you waiting for?

Let us explore them one by one with Python code.

We will work with a simple dataframe:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline


df = pd.DataFrame({
    'Income': [15000, 1800, 120000, 10000],
    'Age': [25, 18, 42, 51],
    'Department': ['HR','Legal','Marketing','Management']
})

Before directly applying any feature transformation or scaling technique, we need to remember the categorical column: Department and first deal with it. This is because we cannot scale non-numeric values.

For that, we 1st create a copy of our dataframe and store the numerical feature names in a list, and their values as well:

df_scaled = df.copy()
col_names = ['Income', 'Age']
features = df_scaled[col_names]

We will execute this snippet before using a new scaler every time.

MinMax Scaler

The MinMax scaler is one of the simplest scalers to understand. It just scales all the data between 0 and 1. The formula for calculating the scaled value is-

x_scaled = (x – x_min)/(x_max – x_min)

Thus, a point to note is that it does so for every feature separately. Though (0, 1) is the default range, we can define our range of max and min values as well. How to implement the MinMax scaler?

We will first need to import it

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()

Apply it on only the values of the features:

df_scaled[col_names] = scaler.fit_transform(features.values)

How do the scaled values look like?

Min-Max Scaling

You can see how the values were scaled. The minimum value among the columns became 0, and the maximum value was changed to 1, with other values in between. However, suppose we don’t want the income or age to have values like 0. Let us take the range to be (5, 10)

from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(5, 10))

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

This is what the output looks like:

Min-Max Scaling

Amazing, right? The min-max scaler lets you set the range in which you want the variables to be.

Standard Scaler

Just like the MinMax Scaler, the Standard Scaler is another popular scaler that is very easy to understand and implement.

For each feature, the Standard Scaler scales the values such that the mean is 0 and the standard deviation is 1(or the variance).

x_scaled = x – mean/std_dev

However, Standard Scaler assumes that the distribution of the variable is normal. Thus, in case, the variables are not normally distributed, we

either choose a different scaler
or first, convert the variables to a normal distribution and then apply this scaler

Implementing the standard scaler is much similar to implementing a min-max scaler. Just like before, we will first import StandardScaler and then use it to transform our variable.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

The output after applying the scaler to our data:

Standard Scaling

Let us check the mean and standard deviation of both the columns by performing a describe() function on df_scaled

df_scaled.describe()

Output:

You will notice that the values are not exactly, but very close to 0(same with standard deviation). This occurs due to the numerical precision of floating-point numbers in Python.

MaxAbsScaler

In simplest terms, the MaxAbs scaler takes the absolute maximum value of each column and divides each value in the column by the maximum value.

Thus, it first takes the absolute value of each value in the column and then takes the maximum value out of those. This operation scales the data between the range [-1, 1]. To see how it works, we will add another column called ‘Balance” which contains negative values:

df["Balance"] = [100.0, -263.0, 2000.0, -5.0]
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

Output:

MaxAbs Scaler

We can confirm that the MaxAbs Scaler works as expected by printing the maximum values of each column before we scaled it:

df["Income"].max(), df["Age"].max(), df['Balance'].max()

Output:

(120000, 51, 2000.0)

Thus, we can see that

each value in the Income column is divided by 12000
each value in the Age column is divided by 51
each value in the Balance column is divided by 2000

Robust Scaler

If you have noticed in the scalers we used so far, each of them was using values like the mean, maximum and minimum values of the columns. All these values are sensitive to outliers. If there are too many outliers in the data, they will influence the mean and the max value or the min value. Thus, even if we scale this data using the above methods, we cannot guarantee a balanced data with a normal distribution.

The Robust Scaler, as the name suggests is not sensitive to outliers. This scaler-

removes the median from the data
scales the data by the InterQuartile Range(IQR)

Are you familiar with the Inter-Quartile Range? It is nothing but the difference between the first and third quartile of the variable. The interquartile range can be defined as-

IQR = Q3 – Q1

Thus, the formula would be:

x_scaled = (x – Q1)/(Q3 – Q1)

This is the default range, though we can define our own range if we want to. Now let us see how can we implement the Robust Scaler in python:

from sklearn.preprocessing import RobustScaler
scaler = RobustScaler()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

The output of Robust Scaler:

Robust Scaler

Quantile Transformer Scaler

One of the most interesting feature transformation techniques that I have used, the Quantile Transformer Scaler converts the variable distribution to a normal distribution. and scales it accordingly. Since it makes the variable normally distributed, it also deals with the outliers. Here are a few important points regarding the Quantile Transformer Scaler:

1. It computes the cumulative distribution function of the variable

2. It uses this cdf to map the values to a normal distribution

3. Maps the obtained values to the desired output distribution using the associated quantile function

A caveat to keep in mind though: Since this scaler changes the very distribution of the variables, linear relationships among variables may be destroyed by using this scaler. Thus, it is best to use this for non-linear data. Here is the code for using the Quantile Transformer:

from sklearn.preprocessing import QuantileTransformer
scaler = QuantileTransformer()

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

Output:

Quantile Feature Transformation

The effects of both the RobustScaler and the QuantileTransformer can be seen on a larger dataset instead of one with 4 rows. Thus, I encourage you to take up a larger dataset and try these Scalers on their columns to fully understand the changes to the data.

Log Transform

The Log Transform is one of the most popular Transformation techniques out there. It is primarily used to convert a skewed distribution to a normal distribution/less-skewed distribution. In this transform, we take the log of the values in a column and use these values as the column instead.

Why does it work? It is because the log function is equipped to deal with large numbers. Here is an example-

log(10) = 1

log(100) = 2, and

log(10000) = 4.

Thus, in our example, while plotting the histogram of Income, it ranges from 0 to 1,20,000:

Original Data

Let us see what happens when we apply log on this column:

df['log_income'] = np.log(df['Income'])
# We created a new column to store the log values

This is how the dataframe looks like:

Log Transformation

Wow! While our Income column had extreme values ranging from 1800 to 1,20,000 – the log values are now ranging from approximately 7.5 to 11.7! Thus, the log operation had a dual role:

Reducing the impact of too-low values
Reducing the impact of too-high values.

A small caveat though – if our data has negative values or values ranging from 0 to 1, we cannot apply log transform directly – since the log of negative numbers and numbers between 0 and 1 is undefined, we would get error or NaN values in our data. In such cases, we can add a number to these values to make them all greater than 1. Then, we can apply the log transform.

Let us plot a histogram of the above, using 5 bins:

df['log_income'].plot.hist(bins = 5)

Power Transformer Scaler

I often use this feature transformation technique when I am building a linear model. To be more specific, I use it when I am dealing with heteroskedasticity. Like some other scalers we studied above, the Power Transformer also changes the distribution of the variable, as in, it makes it more Gaussian(normal). We are familiar with similar power transforms such as square root, and cube root transforms, and log transforms.

However, to use them, we need to first study the original distribution, and then make a choice. The Power Transformer actually automates this decision making by introducing a parameter called lambda. It decides on a generalized power transform by finding the best value of lambda using either the:

1. Box-Cox transform

2. The Yeo-Johnson transform

While I will not get into too much detail of how each of the above transforms works, it is helpful to know that Box-Cox works with only positive values, while Yeo-Johnson works with both positive and negative values.

In our case, we will use the Box-Cox transform since all our values are positive.

from sklearn.preprocessing import PowerTransformer
scaler = PowerTransformer(method = 'box-cox')
'''
parameters:
method = 'box-cox' or 'yeo-johnson'
'''

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

This is how the Power Transformer scales the data:

Power Feature Transformation

Unit Vector Scaler/Normalizer

Normalization is the process of scaling individual samples to have unit norm. The most interesting part is that unlike the other scalers which work on the individual column values, the Normalizer works on the rows! Each row of the dataframe with at least one non-zero component is rescaled independently of other samples so that its norm (l1, l2, or inf) equals one.

Just like MinMax Scaler, the Normalizer also converts the values between 0 and 1, and between -1 to 1 when there are negative values in our data.

However, there is a difference in the way it does so.

If we are using L1 norm, the values in each column are converted so that the sum of their absolute values along the row = 1
If we are using L2 norm, the values in each column are first squared and added so that the sum of their absolute values along the row = 1

from sklearn.preprocessing import Normalizer
scaler = Normalizer(norm = 'l2')
# norm = 'l2' is default

df_scaled[col_names] = scaler.fit_transform(features.values)
df_scaled

The output of Normalizer:

Vector feature transformation

Thus, if you check the first row,

(.999999)^2 + (0.001667)^2 = 1.000(approx)

Similarly, you can check for all rows, and try out the above with norm = ‘l1’ as well.

You may refer to this article to understand the difference between Normalization and Standard Scaler – Feature Scaling for Machine Learning: Understanding the Difference Between Normalization vs. Standardization

Custom Transformer

Consider this situation – Suppose you have your own Python function to transform the data. Sklearn also provides the ability to apply this transform to our dataset using what is called a FunctionTransformer.

Let us take a simple example. I have a feature transformation technique that involves taking (log to the base 2) of the values. In NumPy, there is a function called log2 which does that for us.

Thus, we can now apply the FunctionTransformer:

from sklearn.preprocessing import FunctionTransformer
transformer = FunctionTransformer(np.log2, validate = True)

df_scaled[col_names] = transformer.transform(features.values)
df_scaled

Here is the output with log-base 2 applied on Age and Income:

Custom Transformer

End Notes

To summarize saw the effects of feature transformation and scaling on our data and when to use which scaler. Also, we noticed that some scalers were sensitive to outliers, while others were robust. We also noticed how some scalers change the underlying distribution of the data itself.

Each feature scaling technique has its own characteristics which we can leverage to improve our model. However, just like other steps in building a predictive model, choosing the right scaler is also a trial and error process, and there is no single best scaler that works every time.

Keeping all these in mind, I encourage you to take up various datasets with different kinds of values and try applying these feature transformation and scaling techniques on them. Do comment below your findings on these scalers and how used them to improve your evaluation metric!

Purva Huilgol 15 Aug 2020

Beginner Machine Learning Structured Thinking Technique

Radhika 04 Sep, 2020

Great Article.

Leo 30 Jan, 2021

Great article. Very useful. Thanks so much for sharing! I had a question - In MaxAbsScalar Python code, you have taken df['balance'].max(). But shouldn't we first take the modulus/absolute value, and then do max()? In this case, it wouldn't make a difference, but if there was another value in balance = - 3000, then based on the definition of MaxAbsScalar, shouldn't we divide by +3000 instead of 2000?

Indira Vemuri 10 Feb, 2021

Can't ask for more detail! Wonderful Explanation!!!

Senol 25 Feb, 2021

Thanks for the great article. Should we check distribution of each feature and transform them separately or should we use one scaler for all the features?

ling 10 Oct, 2022

Thanks, awesome. Now all this is clearer.

Feature Transformation and Scaling Techniques to Boost Your Model Performance

Overview

Introduction

Table of Contents

Why do we need Feature Transformation and Scaling?

MinMax Scaler

x_scaled = (x – x_min)/(x_max – x_min)

Standard Scaler

x_scaled = x – mean/std_dev

MaxAbsScaler

Robust Scaler

Quantile Transformer Scaler

Log Transform

Power Transformer Scaler

Unit Vector Scaler/Normalizer

Custom Transformer

End Notes

Frequently Asked Questions

Responses From Readers

Write for us

Machine Learning

Feature Transformation and Scaling Techniques to Boost Your Model Performance

Overview

Introduction

Table of Contents

Why do we need Feature Transformation and Scaling?

MinMax Scaler

x_scaled = (x – x_min)/(x_max – x_min)

Standard Scaler

x_scaled = x – mean/std_dev

MaxAbsScaler

Robust Scaler

Quantile Transformer Scaler

Log Transform

Power Transformer Scaler

Unit Vector Scaler/Normalizer

Custom Transformer

End Notes

Frequently Asked Questions

Responses From Readers

Write for us

Machine Learning

Basics of Machine Learning

Machine Learning Lifecycle

Importance of Stats and EDA

Understanding Data

Probability

Exploring Continuous Variable

Exploring Categorical Variables

Missing Values and Outliers

Central Limit theorem

Bivariate Analysis Introduction

Continuous - Continuous Variables

Continuous Categorical

Categorical Categorical

Multivariate Analysis

Different tasks in Machine Learning

Build Your First Predictive Model

Evaluation Metrics

Preprocessing Data

Linear Models

KNN

Selecting the Right Model

Feature Selection Techniques

Decision Tree

Feature Engineering

NaÃ¯ve Bayes

Multiclass and Multilabel

Basics of Ensemble Techniques

Advance Ensemble Techniques

Hyperparameter Tuning

Support Vector Machine

Advance Dimensionality Reduction

Unsupervised Machine Learning Methods

Recommendation Engines

Improving ML models

Working with Large Datasets

Interpretability of Machine Learning Models

Automated Machine Learning

Model Deployment

Deploying ML Models

Embedded Devices