Feature Scaling: Engineering, Normalization, and Standardization (Updated 2024)

Aniruddha Bhandari 04 Jan, 2024 • 10 min read

Feature Scaling is a critical step in building accurate and effective machine learning models. One key aspect of feature engineering is scaling, normalization, and standardization, which involves transforming the data to make it more suitable for modeling. These techniques can help to improve model performance, reduce the impact of outliers, and ensure that the data is on the same scale. In this article, we will explore the concepts of scaling, normalization, and standardization, including why they are important and how to apply them to different types of data. By the end of this article, you’ll have a thorough understanding of these essential feature engineering techniques and be able to apply them to your own machine learning projects.

What is Feature Scaling?

Feature scaling is a data preprocessing technique used to transform the values of features or variables in a dataset to a similar scale. The purpose is to ensure that all features contribute equally to the model and to avoid the domination of features with larger values.

Feature scaling becomes necessary when dealing with datasets containing features that have different ranges, units of measurement, or orders of magnitude. In such cases, the variation in feature values can lead to biased model performance or difficulties during the learning process.

There are several common techniques for feature scaling, including standardization, normalization, and min-max scaling. These methods adjust the feature values while preserving their relative relationships and distributions.

By applying feature scaling, the dataset’s features can be transformed to a more consistent scale, making it easier to build accurate and effective machine learning models. Scaling facilitates meaningful comparisons between features, improves model convergence, and prevents certain features from overshadowing others based solely on their magnitude.

Why Should we Use Feature Scaling?

Some machine learning algorithms are sensitive to feature scaling, while others are virtually invariant. Let’s explore these in more depth:

1. Gradient Descent Based Algorithms

Machine learning algorithms like linear regression, logistic regression, neural network, PCA (principal component analysis), etc., that use gradient descent as an optimization technique require data to be scaled. Take a look at the formula for gradient descent below:

Gradient descent formula

The presence of feature value X in the formula will affect the step size of the gradient descent. The difference in the ranges of features will cause different step sizes for each feature. To ensure that the gradient descent moves smoothly towards the minima and that the steps for gradient descent are updated at the same rate for all the features, we scale the data before feeding it to the model.

Having features on a similar scale can help the gradient descent converge more quickly towards the minima.

2. Distance-Based Algorithms

Distance algorithms like KNN, K-means clustering, and SVM(support vector machines) are most affected by the range of features. This is because, behind the scenes, they are using distances between data points to determine their similarity.

For example, let’s say we have data containing high school CGPA scores of students (ranging from 0 to 5) and their future incomes (in thousands Rupees):

Feature scaling: Unscaled Knn example

Since both the features have different scales, there is a chance that higher weightage is given to features with higher magnitudes. This will impact the performance of the machine learning algorithm; obviously, we do not want our algorithm to be biased towards one feature.

Therefore, we scale our data before employing a distance based algorithm so that all the features contribute equally to the result.

Feature scaling: Scaled Knn example

The effect of scaling is conspicuous when we compare the Euclidean distance between data points for students A and B, and between B and C, before and after scaling, as shown below:

  • Distance AB before scaling =>Euclidean distance
  • Distance BC before scaling =>Euclidean distance
  • Distance AB after scaling =>Euclidean distance
  • Distance BC after scaling =>Euclidean distance

3. Tree-Based Algorithms

Tree-based algorithms, on the other hand, are fairly insensitive to the scale of the features. Think about it, a decision tree only splits a node based on a single feature. The decision tree splits a node on a feature that increases the homogeneity of the node. Other features do not influence this split on a feature.

So, the remaining features have virtually no effect on the split. This is what makes them invariant to the scale of the features!

What is Normalization?

Normalization, a vital aspect of Feature Scaling, is a data preprocessing technique employed to standardize the values of features in a dataset, bringing them to a common scale. This process enhances data analysis and modeling accuracy by mitigating the influence of varying scales on machine learning models.

Normalization is a scaling technique in which values are shifted and rescaled so that they end up ranging between 0 and 1. It is also known as Min-Max scaling.

Here’s the formula for normalization:

Normalization equation

Here, Xmax and Xmin are the maximum and the minimum values of the feature, respectively.

  • When the value of X is the minimum value in the column, the numerator will be 0, and hence X’ is 0
  • On the other hand, when the value of X is the maximum value in the column, the numerator is equal to the denominator, and thus the value of X’ is 1
  • If the value of X is between the minimum and the maximum value, then the value of X’ is between 0 and 1

What is Standardization?

Standardization is another Feature scaling method where the values are centered around the mean with a unit standard deviation. This means that the mean of the attribute becomes zero, and the resultant distribution has a unit standard deviation.

Here’s the formula for standardization:

Standardization equation
Feature scaling: Mu is the mean of the feature values and Feature scaling: Sigma is the standard deviation of the feature values. Note that, in this case, the values are not restricted to a particular range.

Now, the big question in your mind must be when should we use normalization and when should we use standardization? Let’s find out!

The Big Question – Normalize or Standardize?

NormalizationStandardization
Rescales values to a range between 0 and 1Centers data around the mean and scales to a standard deviation of 1
Useful when the distribution of the data is unknown or not GaussianUseful when the distribution of the data is Gaussian or unknown
Sensitive to outliersLess sensitive to outliers
Retains the shape of the original distributionChanges the shape of the original distribution
May not preserve the relationships between the data pointsPreserves the relationships between the data points
Equation: (x – min)/(max – min)Equation: (x – mean)/standard deviation

However, at the end of the day, the choice of using normalization or standardization will depend on your problem and the machine learning algorithm you are using. There is no hard and fast rule to tell you when to normalize or standardize your data. You can always start by fitting your model to raw, normalized, and standardized data and comparing the performance for the best results.

It is a good practice to fit the scaler on the training data and then use it to transform the testing data. This would avoid any data leakage during the model testing process. Also, the scaling of target values is generally not required.

Implementing Feature Scaling in Python

Now comes the fun part – putting what we have learned into practice. I will be applying feature scaling to a few machine-learning algorithms on the Big Mart dataset. I’ve taken on the DataHack platform.

I will skip the preprocessing steps since they are out of the scope of this tutorial. But you can find them neatly explained in this article. Those steps will enable you to reach the top 20 percentile on the hackathon leaderboard, so that’s worth checking out!

So, let’s first split our data into training and testing sets:

Python Code:

Before moving to the feature scaling part, let’s glance at the details of our data using the pd.describe() method:

Feature scaling: Original data

We can see that there is a huge difference in the range of values present in our numerical features: Item_Visibility, Item_Weight, Item_MRP, and Outlet_Establishment_Year. Let’s try and fix that using feature scaling!

Note: You will notice negative values in the Item_Visibility feature because I have taken log-transformation to deal with the skewness in the feature.

Normalization Using sklearn (scikit-learn)

To normalize your data, you need to import the MinMaxScaler from the sklearn library and apply it to our dataset. So, let’s do that!

Let’s see how normalization has affected our dataset:

Feature scaling: Normalized data

All the features now have a minimum value of 0 and a maximum value of 1. Perfect!

Try out the above code in the live coding window below!!

Next, let’s try to standardize our data.

Standardization Using sklearn

To standardize your data, you need to import the StandardScaler from the sklearn library and apply it to our dataset. Here’s how you can do it:

You would have noticed that I only applied standardization to my numerical columns, not the other One-Hot Encoded features. Standardizing the One-Hot encoded features would mean assigning a distribution to categorical features. You don’t want to do that!

But why did I not do the same while normalizing the data? Because One-Hot encoded features are already in the range between 0 to 1. So, normalization would not affect their value.

Right, let’s have a look at how standardization has transformed our data:

Feature scaling: Standardized data

The numerical features are now centered on the mean with a unit standard deviation. Awesome!

Comparing Unscaled, Normalized, and Standardized Data

It is always great to visualize your data to understand the distribution present. We can see the comparison between our unscaled and scaled data using boxplots.

Comparision b/w unscaled and scaled data, feature scaling

You can notice how scaling the features brings everything into perspective. The features are now more comparable and will have a similar effect on the learning models.

Applying Scaling to Machine Learning Algorithms

It’s now time to train some machine learning algorithms on our data to compare the effects of different Feature scaling techniques on the algorithm’s performance. I want to see the effect of scaling on three algorithms in particular: K-Nearest Neighbors, Support Vector Regressor, and Decision Tree.

Now, let’s delve into training machine learning algorithms on our dataset to assess the impact of various scaling techniques on their performance. Specifically, I aim to observe the effects of scaling on three key algorithms: K-Nearest Neighbors, Support Vector Regressor, and Decision Tree. This analysis will provide valuable insights into the significance of feature scaling in machine learning and how it influences the outcomes of these algorithms.

K-Nearest Neighbors

As we saw before, KNN is a distance-based algorithm that is affected by the range of features. Let’s see how it performs on our data before and after scaling:

Feature scaling: K-Nearest Neighbors

You can see that scaling the features has brought down the RMSE score of our KNN model. Specifically, the normalized data performs a tad bit better than the standardized data.

Note: I am measuring the RMSE here because this competition evaluates the RMSE.

Support Vector Regressor

SVR is another distance-based algorithm. So let’s check out whether it works better with normalization or standardization:

Feature scaling: Support Vector Regressor

We can see that scaling the features does bring down the RMSE score. And the standardized data has performed better than the normalized data. Why do you think that’s the case?

The sklearn documentation states that SVM, with RBF kernel,  assumes that all the features are centered around zero and variance is of the same order. This is because a feature with a variance greater than that of others prevents the estimator from learning from all the features. Great!

Decision Tree

We already know that a Decision tree is invariant to feature scaling. But I wanted to show a practical example of how it performs on the data:

Feature scaling: Decision Tree

You can see that the RMSE score has not moved an inch on scaling the features. So rest assured when you are using tree-based algorithms on your data!

Build Effective Machine Learning Models

This tutorial covered the relevance of using feature scaling on your data and how normalization and standardization have varying effects on the working of machine learning algorithms. Remember that there is no correct answer to when to use normalization over standardization and vice-versa. It all depends on your data and the algorithm you are using.

To enhance your skills in feature engineering and other key data science techniques, consider enrolling in our Data Science Black Belt program. Our comprehensive curriculum covers all aspects of data science, including advanced topics such as feature engineering, machine learning, and deep learning. With hands-on projects and mentorship, you’ll gain practical experience and the skills you need to succeed in this exciting field. Enroll today and take your data science skills to the next level!

Frequently Asked Questions

Q1. How is Standardization different from Normalization feature scaling?

A. Standardization centers data around a mean of zero and a standard deviation of one, while normalization scales data to a set range, often [0, 1], by using the minimum and maximum values.

Q2. Why is Standardization used in machine learning?

A. Standardization ensures algorithmic stability and prevents sensitivity to the scale of input features, improves optimization algorithms’ convergence and search efficiency, and enhances the performance of certain machine learning algorithms.

Q3. Why is Normalization used in machine learning?

A. Normalization helps in scaling the input features to a fixed range, typically [0, 1], to ensure that no single feature disproportionately impacts the results. It preserves the relationship between the minimum and maximum values of each feature, which can be important for some algorithms. It also improves the convergence and stability of some machine learning algorithms, particularly those that use gradient-based optimization.

Q4. Why do we normalize values?

A. We normalize values to bring them into a common scale, making it easier to compare and analyze data. Normalization also helps to reduce the impact of outliers and improve the accuracy and stability of statistical models.

Q5. How do you normalize a set of values?

A. To normalize a set of values, we first calculate the mean and standard deviation of the data. Then, we subtract the mean from each value and divide by the standard deviation to obtain standardized values with a mean of 0 and a standard deviation of 1. Alternatively, we can use other normalization techniques such as min-max normalization, where we scale the values to a range of 0 to 1, or unit vector normalization, where we scale the values to have a length of 1.

Frequently Asked Questions

Lorem ipsum dolor sit amet, consectetur adipiscing elit,

Responses From Readers

Clear

Ali
Ali 12 Apr, 2020

Excelent article! Thank you very much for sharing. I have one question. In the post you say: "It is a good practice to fit the scaler on the training data and then use it to transform the testing data.", but I didn't see that in the code you posted. Am I wrong? How would one "fit the scaler on the training data and then use it to transform the testing data"? Thanks a lot again

sahil kamboj
sahil kamboj 28 Apr, 2020

Good article! Thank you very much for sharing. I have one question. What the difference between sklearn.preprocessing import MinMaxScaler Normalization and sklearn.preprocessing.Normalizer? When to use MinMaxScaler and when to Normalize?

Subhash Kumar Nadar
Subhash Kumar Nadar 23 May, 2020

Excellent article! Easy to understand and good coverage One question: I see that there is a scale() funtion as well from sklearn and short description suggest it to be similar to StandardScaler i.e. scaling to unit variance I could not find more than this explanation. Please can you suggest which on to use which scenario? Thanks in advance!

Golla kedarkumar
Golla kedarkumar 24 May, 2020

Hi ANIRUDDHA, If we use the same scaler for train and testing, does it affect the testing data because in standardization we need to use the mean of the data. If we take the mean of the train data and scale the test data, it will influence the test data, right?

Inas
Inas 27 May, 2020

Excellent article, thank you for sharing.

soumadip roy
soumadip roy 04 Jul, 2020

This is an excellent write up. Thanks for this.

HARSHVARDHAN BHATT
HARSHVARDHAN BHATT 05 Jul, 2020

That graphs really helps in putting things in perspective...thanks !

Arnob
Arnob 05 Jul, 2020

Hey bro! Great article. It covered a lots of topics that were unclear to me before. I have a basic question. How can I check my data after normalization. You have mentioned to use pd.describe() in "Normalization using sklearn' section. But when I use it I get an error - " module 'pandas' has no attribute 'describe'". Can you tell me how to check my data after normalization? Thank you for your time.

Omobolaji
Omobolaji 11 Jul, 2020

I'm quite new to ML, and I definitely find this article very explicit and helpful. Thank you for sharing. My question is regarding outliers. Since normalisation centres the values around 0-1, is it fair to say it rids the data of outliers; as against standardisation, which might not....Simply put, after normalising, will outliers still affect the model?

deeps
deeps 15 Jul, 2020

Excellent article !

Niyaz
Niyaz 18 Jul, 2020

Do we need to scale down our test data, if yes then on what bases bcz we mai not know how d test data varies it mai be too large than the train datas max. value. And plz let me know how prepossessing is done on test data before fit is done. Thank you

Kunal
Kunal 31 Jul, 2020

Thanks for Great Article..!!!

Zineb
Zineb 17 Aug, 2020

Thanks Bhandari. Easy to understand and very helpful.

Tony
Tony 06 Dec, 2021

Hi Aniruddha!Quick question - I've seen a few people already mention that standardization is good if the feature follows Gaussian distribution but don't really get why. Mind shedding more light on that?

Justin Ma
Justin Ma 26 Jan, 2022

"What could be the reason behind this quirk?" If you're wondering why there is a computational improvement, it's because of the matrix algebra in the background. It also reduces rounding errors. In layperson terms, it's much easier to do a division with similar numbers e.g. 5/3 vs. 5/0.0003. It's not different for computers.

Jasbir Singh
Jasbir Singh 16 Aug, 2022

Hi, Excellent article. Have you worked on an example of linear regression with k-means clustering and feature scaling after train-test split? An example would be helpful to understand the entire process. Cheers for spreading insightful knowledge

Jigme T
Jigme T 24 Oct, 2022

Hi !! I would like to share a sample of my data and seek your advice on the featuring scaling methods.

Geofrey
Geofrey 13 Dec, 2022

Quite an outstanding article! It answered most of my pressing questions in regards to scaling... Thank You so much.

Idelphonse
Idelphonse 12 Jan, 2024

Thank you for this article. We know that standardization is recommended when the distribution of the features is not Gaussian while normalization is appropriate for Gaussian features. However, in practice, we always have features with both Gaussian and non-Gaussian distributions. What is the better approach in this case? Are we going to combine both approaches for scaling features (i.e. normalization for non-gaussian features and standardization for Gaussian ones) ?

  • [tta_listen_btn class="listen"]