- Understand the requirement of feature transformation and scaling techniques
- Get to know different feature transformation and scaling techniques including-
- MinMax Scaler
- Standard Scaler
- Power Transformer Scaler
- Unit Vector Scaler/Normalizer

In my machine learning journey, more often than not, I have found that feature preprocessing is a more effective technique in improving my evaluation metric than any other step, like choosing a model algorithm, hyperparameter tuning, etc.

Feature preprocessing is one of the most crucial steps in building a Machine learning model. Too few features and your model won’t have much to learn from. Too many features and we might be feeding unnecessary information to the model. Not only this, but the values in each of the features need to be considered as well.

We know that there are some set rules of dealing with categorical data, as in, encoding them in different ways. However, a large chunk of the process involves dealing with continuous variables. There are various methods of dealing with continuous variables. Some of them include converting them to a normal distribution or converting them to categorical variables, etc.

There are a couple of go-to techniques I always use regardless of the model I am using, or whether it is a classification task or regression task, or even an unsupervised learning model. These techniques are:

- Feature Transformation and
- Feature Scaling.

*To get started with Data Science and Machine Learning, check out our course – Applied Machine Learning – Beginner to Professional *

- Why do we need Feature Transformation and Scaling?
- MinMax Scaler
- Standard Scaler
- MaxAbsScaler
- Robust Scaler
- Quantile Transformer Scaler
- Log Transformation
- Power Transformer Scaler
- Unit Vector Scaler/Normalizer

Oftentimes, we have datasets in which different columns have different units – like one column can be in kilograms, while another column can be in centimeters. Furthermore, we can have columns like income which can range from 20,000 to 100,000, and even more; while an age column which can range from 0 to 100(at the most). Thus, Income is about 1,000 times larger than age.

But how can we be sure that the model treats both these variables equally? When we feed these features to the model as is, there is every chance that the income will influence the result more due to its larger value. But this doesn’t necessarily mean it is more important as a predictor. So, to give importance to both Age, and Income, we need feature scaling.

In most examples of machine learning models, you would have observed either the Standard Scaler or MinMax Scaler. However, the powerful sklearn library offers many other feature transformations scaling techniques as well, which we can leverage depending on the data we are dealing with. So, what are you waiting for?

Let us explore them one by one with Python code.

We will work with a simple dataframe:

import pandas as pd import numpy as np import matplotlib.pyplot as plt %matplotlib inline df = pd.DataFrame({ 'Income': [15000, 1800, 120000, 10000], 'Age': [25, 18, 42, 51], 'Department': ['HR','Legal','Marketing','Management'] })

Before directly applying any feature transformation or scaling technique, we need to remember the categorical column: Department and first deal with it. This is because we cannot scale non-numeric values.

For that, we 1st create a copy of our dataframe and store the numerical feature names in a list, and their values as well:

df_scaled = df.copy() col_names = ['Income', 'Age'] features = df_scaled[col_names]

We will execute this snippet before using a new scaler every time.

The MinMax scaler is one of the simplest scalers to understand. It just scales all the data between 0 and 1. The formula for calculating the scaled value is-

Thus, a point to note is that it does so for every feature separately. Though (0, 1) is the default range, we can define our range of max and min values as well. How to implement the MinMax scaler?

- We will first need to import it
from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler()

- Apply it on only the values of the features:
df_scaled[col_names] = scaler.fit_transform(features.values)

How do the scaled values look like?

You can see how the values were scaled. The minimum value among the columns became 0, and the maximum value was changed to 1, with other values in between. However, suppose we don’t want the income or age to have values like 0. Let us take the range to be (5, 10)

from sklearn.preprocessing import MinMaxScaler scaler = MinMaxScaler(feature_range=(5, 10)) df_scaled[col_names] = scaler.fit_transform(features.values) df_scaled

This is what the output looks like:

Amazing, right? The min-max scaler lets you set the range in which you want the variables to be.

Just like the MinMax Scaler, the Standard Scaler is another popular scaler that is very easy to understand and implement.

For each feature, the Standard Scaler scales the values such that the mean is 0 and the standard deviation is 1(or the variance).

However, Standard Scaler assumes that the distribution of the variable is normal. Thus, in case, the variables are not normally distributed, we

- either choose a different scaler
- or first, convert the variables to a normal distribution and then apply this scaler

Implementing the standard scaler is much similar to implementing a min-max scaler. Just like before, we will first import StandardScaler and then use it to transform our variable.

from sklearn.preprocessing import StandardScaler scaler = StandardScaler() df_scaled[col_names] = scaler.fit_transform(features.values) df_scaled

The output after applying the scaler to our data:

Let us check the mean and standard deviation of both the columns by performing a describe() function on df_scaled

df_scaled.describe()

Output:

You will notice that the values are not exactly, but very close to 0(same with standard deviation). This occurs due to the numerical precision of floating-point numbers in Python.

In simplest terms, the MaxAbs scaler takes the absolute maximum value of each column and divides each value in the column by the maximum value.

Thus, it first takes the absolute value of each value in the column and then takes the maximum value out of those. This operation scales the data between the range [-1, 1]. To see how it works, we will add another column called ‘Balance” which contains negative values:

df["Balance"] = [100.0, -263.0, 2000.0, -5.0] from sklearn.preprocessing import MaxAbsScaler scaler = MaxAbsScaler() df_scaled[col_names] = scaler.fit_transform(features.values) df_scaled

Output:

We can confirm that the MaxAbs Scaler works as expected by printing the maximum values of each column before we scaled it:

df["Income"].max(), df["Age"].max(), df['Balance'].max()

Output:

(120000, 51, 2000.0)

Thus, we can see that

- each value in the Income column is divided by 12000
- each value in the Age column is divided by 51
- each value in the Balance column is divided by 2000

If you have noticed in the scalers we used so far, each of them was using values like the mean, maximum and minimum values of the columns. All these values are sensitive to outliers. If there are too many outliers in the data, they will influence the mean and the max value or the min value. Thus, even if we scale this data using the above methods, we cannot guarantee a balanced data with a normal distribution.

The Robust Scaler, as the name suggests is not sensitive to outliers. This scaler-

- removes the median from the data
- scales the data by the InterQuartile Range(IQR)

Are you familiar with the Inter-Quartile Range? It is nothing but the difference between the first and third quartile of the variable. The interquartile range can be defined as-

IQR = Q3 – Q1

Thus, the formula would be:

**x_scaled = (x – Q1)/(Q3 – Q1)**

This is the default range, though we can define our own range if we want to. Now let us see how can we implement the Robust Scaler in python:

from sklearn.preprocessing import RobustScaler scaler = RobustScaler() df_scaled[col_names] = scaler.fit_transform(features.values) df_scaled

The output of Robust Scaler:

One of the most interesting feature transformation techniques that I have used, the Quantile Transformer Scaler converts the variable distribution to a normal distribution. and scales it accordingly. Since it makes the variable normally distributed, it also deals with the outliers. Here are a few important points regarding the Quantile Transformer Scaler:

1. It computes the cumulative distribution function of the variable

2. It uses this cdf to map the values to a normal distribution

3. Maps the obtained values to the desired output distribution using the associated quantile function

A caveat to keep in mind though: Since this scaler changes the very distribution of the variables, linear relationships among variables may be destroyed by using this scaler. Thus, it is best to use this for non-linear data. Here is the code for using the Quantile Transformer:

from sklearn.preprocessing import QuantileTransformer scaler = QuantileTransformer() df_scaled[col_names] = scaler.fit_transform(features.values) df_scaled

Output:

The effects of both the RobustScaler and the QuantileTransformer can be seen on a larger dataset instead of one with 4 rows. Thus, I encourage you to take up a larger dataset and try these Scalers on their columns to fully understand the changes to the data.

The Log Transform is one of the most popular Transformation techniques out there. It is primarily used to convert a skewed distribution to a normal distribution/less-skewed distribution. In this transform, we take the log of the values in a column and use these values as the column instead.

Why does it work? It is because the log function is equipped to deal with large numbers. Here is an example-

log(10) = 1

log(100) = 2, and

log(10000) = 4.

Thus, in our example, while plotting the histogram of Income, it ranges from 0 to 1,20,000:

Let us see what happens when we apply log on this column:

df['log_income'] = np.log(df['Income']) # We created a new column to store the log values

This is how the dataframe looks like:

Wow! While our Income column had extreme values ranging from 1800 to 1,20,000 – the log values are now ranging from approximately 7.5 to 11.7! Thus, the log operation had a dual role:

- Reducing the impact of too-low values
- Reducing the impact of too-high values.

A small caveat though – if our data has negative values or values ranging from 0 to 1, we cannot apply log transform directly – since the log of negative numbers and numbers between 0 and 1 is undefined, we would get error or NaN values in our data. In such cases, we can add a number to these values to make them all greater than 1. Then, we can apply the log transform.

Let us plot a histogram of the above, using 5 bins:

df['log_income'].plot.hist(bins = 5)

I often use this feature transformation technique when I am building a linear model. To be more specific, I use it when I am dealing with heteroskedasticity. Like some other scalers we studied above, the Power Transformer also changes the distribution of the variable, as in, it makes it more Gaussian(normal). We are familiar with similar power transforms such as square root, and cube root transforms, and log transforms.

However, to use them, we need to first study the original distribution, and then make a choice. The Power Transformer actually automates this decision making by introducing a parameter called *lambda*. It decides on a generalized power transform by finding the best value of lambda using either the:

While I will not get into too much detail of how each of the above transforms works, it is helpful to know that Box-Cox works with only positive values, while Yeo-Johnson works with both positive and negative values.

In our case, we will use the Box-Cox transform since all our values are positive.

from sklearn.preprocessing import PowerTransformer scaler = PowerTransformer(method = 'box-cox') ''' parameters: method = 'box-cox' or 'yeo-johnson' ''' df_scaled[col_names] = scaler.fit_transform(features.values) df_scaled

This is how the Power Transformer scales the data:

Normalization is the process of scaling individual samples to have unit norm. The most interesting part is that unlike the other scalers which work on the individual column values, the Normalizer works on the rows! Each row of the dataframe with at least one non-zero component is rescaled independently of other samples so that its norm (l1, l2, or inf) equals one.

Just like MinMax Scaler, the Normalizer also converts the values between 0 and 1, and between -1 to 1 when there are negative values in our data.

However, there is a difference in the way it does so.

- If we are using L1 norm, the values in each column are converted so that the sum of their absolute values along the row = 1
- If we are using L2 norm, the values in each column are first squared and added so that the sum of their absolute values along the row = 1

from sklearn.preprocessing import Normalizer scaler = Normalizer(norm = 'l2') # norm = 'l2' is default df_scaled[col_names] = scaler.fit_transform(features.values) df_scaled

The output of Normalizer:

Thus, if you check the first row,

(.999999)^2 + (0.001667)^2 = 1.000(approx)

Similarly, you can check for all rows, and try out the above with norm = ‘l1’ as well.

You may refer to this article to understand the difference between Normalization and Standard Scaler – Feature Scaling for Machine Learning: Understanding the Difference Between Normalization vs. Standardization

Consider this situation – Suppose you have your own Python function to transform the data. Sklearn also provides the ability to apply this transform to our dataset using what is called a FunctionTransformer.

Let us take a simple example. I have a feature transformation technique that involves taking (log to the base 2) of the values. In NumPy, there is a function called log2 which does that for us.

Thus, we can now apply the FunctionTransformer:

from sklearn.preprocessing import FunctionTransformer transformer = FunctionTransformer(np.log2, validate = True) df_scaled[col_names] = transformer.transform(features.values) df_scaled

Here is the output with log-base 2 applied on Age and Income:

To summarize saw the effects of feature transformation and scaling on our data and when to use which scaler. Also, we noticed that some scalers were sensitive to outliers, while others were robust. We also noticed how some scalers change the underlying distribution of the data itself.

Each feature scaling technique has its own characteristics which we can leverage to improve our model. However, just like other steps in building a predictive model, choosing the right scaler is also a trial and error process, and there is no single best scaler that works every time.

Keeping all these in mind, I encourage you to take up various datasets with different kinds of values and try applying these feature transformation and scaling techniques on them. Do comment below your findings on these scalers and how used them to improve your evaluation metric!

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Become a full stack data scientist
##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

##

Understanding Cost Function
Understanding Gradient Descent
Math Behind Gradient Descent
Assumptions of Linear Regression
Implement Linear Regression from Scratch
Train Linear Regression in Python
Implementing Linear Regression in R
Diagnosing Residual Plots in Linear Regression Models
Generalized Linear Models
Introduction to Logistic Regression
Odds Ratio
Implementing Logistic Regression from Scratch
Introduction to Scikit-learn in Python
Train Logistic Regression in python
Multiclass using Logistic Regression
How to use Multinomial and Ordinal Logistic Regression in R ?
Challenges with Linear Regression
Introduction to Regularisation
Implementing Regularisation
Ridge Regression
Lasso Regression

Introduction to Stacking
Implementing Stacking
Variants of Stacking
Implementing Variants of Stacking
Introduction to Blending
Bootstrap Sampling
Introduction to Random Sampling
Hyper-parameters of Random Forest
Implementing Random Forest
Out-of-Bag (OOB) Score in the Random Forest
IPL Team Win Prediction Project Using Machine Learning
Introduction to Boosting
Gradient Boosting Algorithm
Math behind GBM
Implementing GBM in python
Regularized Greedy Forests
Extreme Gradient Boosting
Implementing XGBM in python
Tuning Hyperparameters of XGBoost in Python
Implement XGBM in R/H2O
Adaptive Boosting
Implementing Adaptive Boosing
LightGBM
Implementing LightGBM in Python
Catboost
Implementing Catboost in Python

Introduction to Clustering
Applications of Clustering
Evaluation Metrics for Clustering
Understanding K-Means
Implementation of K-Means in Python
Implementation of K-Means in R
Choosing Right Value for K
Profiling Market Segments using K-Means Clustering
Hierarchical Clustering
Implementation of Hierarchial Clustering
DBSCAN
Defining Similarity between clusters
Build Better and Accurate Clusters with Gaussian Mixture Models

Introduction to Machine Learning Interpretability
Framework and Interpretable Models
model Agnostic Methods for Interpretability
Implementing Interpretable Model
Understanding SHAP
Out-of-Core ML
Introduction to Interpretable Machine Learning Models
Model Agnostic Methods for Interpretability
Game Theory & Shapley Values

Deploying Machine Learning Model using Streamlit
Deploying ML Models in Docker
Deploy Using Streamlit
Deploy on Heroku
Deploy Using Netlify
Introduction to Amazon Sagemaker
Setting up Amazon SageMaker
Using SageMaker Endpoint to Generate Inference
Deploy on Microsoft Azure Cloud
Introduction to Flask for Model
Deploying ML model using Flask

Great Article.

Great article. Very useful. Thanks so much for sharing! I had a question - In MaxAbsScalar Python code, you have taken df['balance'].max(). But shouldn't we first take the modulus/absolute value, and then do max()? In this case, it wouldn't make a difference, but if there was another value in balance = - 3000, then based on the definition of MaxAbsScalar, shouldn't we divide by +3000 instead of 2000?

Can't ask for more detail! Wonderful Explanation!!!

Thanks for the great article. Should we check distribution of each feature and transform them separately or should we use one scaler for all the features?

Thanks, awesome. Now all this is clearer.

Best article ever! concise, informative, and easy to follow!