Introduction to Batch Normalization

Shipra Saxena Last Updated : 01 May, 2025

8 min read

One of the most common problems data science professionals face is avoiding over-fitting. Have you encountered a situation when your model performs very well on the training data but cannot accurately predict the test data? The reason is your model is overfitting. The solution to such a problem is regularization. In this article, you will learn what batch normalization is.

Note: If you are more interested in learning concepts in an audio-visual format, we have explained this entire article in the video below. If not, you may continue reading.

Regularization techniques help improve a model and allow it to converge faster. We have several regularization tools at our end. Some are early stopping, dropout, weight initialization techniques, and batch normalization in CNN. Regularization helps prevent the model from overfitting, and the learning process becomes more efficient.

In this article, you will learn about batch normalization, also called batch normalisation, and its significance in deep learning. We will explore how batch normalisation in deep learning enhances model performance, stabilizes training, and accelerates convergence.

Learning Objectives:

Understand what batch normalization is and why it is needed in deep neural networks.
Learn how batch normalization works, including the steps of normalization and rescaling/offsetting.
Explore the different techniques of batch normalization and their impact

What is Batch Normalization?
How does Batch Normalization work?
- Normalization of the Input
- Rescaling of Offsetting
Batch Normalization Techniques
Advantages of Batch Normalization
Batch Normalization in TensorFlow
Conclusion
Frequently Asked Questions

What is Batch Normalization?

Before entering into Batch normalization, let’s understand the term “Normalization”.

Normalization is a data pre-processing tool that brings numerical data to a common scale without distorting its shape.

Generally, when we input the data to a machine learning algorithm or deep learning algorithm, we tend to change the values to a balanced scale. We normalize partly to ensure that our model can generalize appropriately.

Now, back to batch normalization is a process that makes neural networks faster and more stable by adding extra layers to a deep neural network. The new layer performs the standardizing and normalizing operations on the input of a layer coming from a previous layer.

But what is the reason behind the term “Batch” in batch normalization? A typical neural network is trained using a collected input data set called batch. Similarly, the normalizing process takes place in batches, not as a single input.

Let’s understand this through an example. We have a deep neural network, as shown in the following image.

neural network What is Batch Normalization

Activation function Sigmoid | Batch Normalization

Initially, our inputs X1, X2, X3, and X4 are normalized as they come from the pre-processing stage. When the input passes through the first layer, it transforms as a sigmoid function applied over the dot product of input X and the weight matrix W. Similarly, this transformation will take place for the second layer and continue until the last layer L, as shown in the following image.

Although our input X was normalized with time, the output will no longer be on the same scale. As the data pass through multiple layers of the neural network and L activation functions are applied, it leads to an internal co-variate shift in the data.

How does Batch Normalization work?

Since we now have a clear idea of why we need Batch Normalization in CNN, let’s understand how it works. It is a two-step process. First, the input is normalized, and later, rescaling and offsetting are performed.

Normalization of the Input

Normalization is the process of transforming the data to have a mean zero and standard deviation one. In this step we have our batch input from layer h, first, we need to calculate the mean of this hidden activation.

Here, m is the number of neurons at layer h.

Once we have achieved our goal, the next step is to calculate the standard deviation of the hidden activations.

Further, as we have the mean and the standard deviation ready. We will normalize the hidden activations using these values. For this, we will subtract the mean from each input and divide the whole value with the sum of standard deviation and the smoothing term (ε).

The smoothing term(ε) assures numerical stability within the operation by stopping a division by a zero value.

Rescaling of Offsetting

In the final operation, the input is re-scaled and offset. Here, two components of the BN algorithm come into the picture: γ(gamma) and β (beta). These parameters are used for re-scaling (γ) and shifting(β) of the vector containing values from the previous operations.

These two are learnable parameters. During the training, the neural network ensures the optimal values of γ and β are used. This will enable the accurate normalization of each batch.

Batch Normalization Techniques

Batch normalization is a deep learning technique that helps our models learn and adapt quickly. It’s like a teacher who helps students by breaking down complex topics into simpler parts.

Why do we need it?

Imagine you’re trying to hit a moving target with a dart. It would be much harder than hitting a stationary one, right? Similarly, in deep learning, our target keeps changing during training due to the continuous updates in weights and biases. This is known as the “internal covariate shift”. Batch normalization helps us stabilize this moving target, making our task easier.

How does it work?

Batch normalization works by normalizing the output of a previous activation layer by subtracting the batch mean and dividing by the batch standard deviation. However, these normalized values may not follow the original distribution. To tackle this, batch normalization introduces two learnable parameters, gamma and beta, which can shift and scale the normalized values.

Benefits of Batch Normalization

Speeds up learning: Reducing internal covariate shift helps the model train faster.
Regularizes the model: It adds a little noise to your model, and in some cases, you might not even need to use dropout or other regularization techniques.
Allows higher learning rates: Gradient descent usually requires small learning rates for the network to converge. Batch normalization helps us use much larger learning rates, speeding up the training process.

Advantages of Batch Normalization

Now, let’s look into the advantages the BN process offers.

Speed Up the Training

By Normalizing the hidden layer activation, the Batch Normalization in CNN speeds up the training process.

Handles internal covariate shift

It solves the problem of internal covariate shift. Through this, we ensure that the input for every layer is distributed around the same mean and standard deviation. If you are unaware of what is an internal covariate shift, look at the following example.

Internal covariate shift

Suppose we train an image classification model that classifies images into Dog or Not Dog. Let’s say we have images of white dogs only. These images will also have a certain distribution. Using these images will update the model’s parameters.

Later, if we get a new set of images consisting of non-white dogs, these new images will have a slightly different distribution from the previous images. Now, the model will change its parameters according to these new images. Hence, the distribution of the hidden activation will also change. This change in hidden activation is known as an internal covariate shift.

However, according to a study by MIT researchers, batch normalization does not solve the problem of internal covariate shift.

In this research, they trained three models

Model-1: standard VGG network without batch normalization.

Model-2: Standard VGG network with batch normalization.

Model-3: Standard VGG with batch normalization and random noise.

This random noise has a non-zero mean and non-unit variance and is added after the batch normalization layer. This experiment reached two conclusions.

The third model has a less stable distribution across all layers. The noisy model has a higher variance than the other two models.

The second conclusion was the training accuracy of the second and third models is higher than the first model. So it can be concluded that internal co-variate shift might not be a contributing factor in the performance of the batch normalization.

Smoothens the Loss Function

Batch normalization smoothens the loss function, which, in turn, improves the model’s training speed by optimizing the model parameters.

Batch normalization is a topic of huge research interest, and a large number of researchers are working on it. If you are looking for further details on this, I recommend you go through the following links.

Batch Normalization in TensorFlow

TensorFlow can be easily implemented using the tf.keras.layers.BatchNormalization layer. Here’s a simple example of how to use it in a model:

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Conv2D, MaxPooling2D

model = Sequential([
    Conv2D(32, (3, 3), input_shape=(28, 28, 1), activation='relu'),
    BatchNormalization(),
    Conv2D(64, (3, 3), activation='relu'),
    BatchNormalization(),
    MaxPooling2D(),
    Dense(10, activation='softmax')
])

In this example, Batch Normalization is applied after convolutional layers, which is a common practice to help stabilize the training of the model

Conclusion

To summarize, this article explained Batch Normalization and how it improves neural network performance. However, we need not perform all this manually, as deep learning libraries like PyTorch and TensorFlow handle the implementation complexities. Still, as a data scientist, it is worth understanding the intricacies of the back end.

Key Takeaways:

Batch normalization helps prevent overfitting and speeds up the training of deep neural networks.
It normalizes the activations of each layer by subtracting the mean and dividing by the standard deviation.
Rescaling and offsetting are done using learnable parameters such as gamma and beta.
It handles internal covariate shifts and smoothens the loss of landscape.

Frequently Asked Questions

Q1. When should I use batch normalization?

A. Use batch normalization when training deep neural networks to stabilize and accelerate learning, improve model performance, and reduce sensitivity to network initialization and learning rates.

Q2. What is the difference between normalization and BatchNormalization?

A. Normalization scales input data to a standard range, like 0 to 1. BatchNormalization normalizes intermediate layers’ activations during training, adjusting mean and variance to improve convergence.

Q3. What does batch normalization do in Keras?

A. In Keras, batch normalization standardizes each layer’s inputs to have a mean of zero and variance of one, thus stabilizing and accelerating the training process.

Q4. Why is batch normalization a regularization?

A. Batch normalization acts as a regularization method by reducing overfitting. It introduces noise through mini-batch statistics, which provides a slight regularizing effect similar to dropout.

Shipra Saxena

Shipra is a Data Science enthusiast, Exploring Machine learning and Deep learning algorithms. She is also interested in Big data technologies. She believes learning is a continuous process so keep moving.

Free Courses

4.8

Ensemble Learning and Ensemble Learning Techniques

Learn ensemble learning, its techniques, and how it works in this course!

4.9

Dimensionality Reduction for Machine Learning

Master key dimensionality reduction techniques for ML success!

Reading list