Deep learning has revolutionised the AI field by allowing machines to grasp more in-depth information within our data. Deep learning has been able to do this by replicating how our brain functions through the logic of neuron synapses. One of the most critical aspects of training deep learning models is how we feed our data into the model during the training process. This is where batch processing and mini-batch training come into play. How we train our models will affect the overall performance of the models when put into production. In this article, we’ll delve deep into these concepts, comparing their pros and cons, and exploring their practical applications.
Training a deep learning model involves minimizing the loss function that measures the difference between the predicted outputs and the actual labels after each epoch. In other words, the training process is a pair dance between Forward Propagation and Backward Propagation. This minimization is typically achieved using gradient descent, an optimization algorithm that updates the model parameters in the direction that reduces the loss.
You can read more about the Gradient Descent Algorithm here.
So here, the data is rarely passed one sample at a time or all at once due to computational and memory constraints. Instead, data is passed in chunks called “batches.”
In the early stages of machine learning and neural network training, two common methods of data processing were used:
1. Stochastic Learning
This method updates the model weights using a single training sample at a time. While it offers the fastest weight updates and can be useful in streaming data applications, it has significant drawbacks:
2. Full-Batch Learning
Here, the entire training dataset is used to compute gradients and perform a single update to the model parameters. It has very stable gradients and convergence behaviour, which are great advantages. Speaking of the disadvantages, however, here are a few:
As datasets grew larger and neural networks became deeper, these approaches proved inefficient in practice. Memory limitations and computational inefficiency pushed researchers and engineers to find a middle ground: mini-batch training.
Now, let us try to understand what batch processing and mini-batch processing.
For each training step, the entire dataset is fed into the model all at once, a process known as batch processing. Another name for this technique is Full-Batch Gradient Descent.
Key Characteristics:
When to Use:
A compromise between batch gradient descent and stochastic gradient descent is mini-batch training. It uses a subset or a portion of the data rather than the entire dataset or a single sample.
Key Characteristics:
When to Use:
Let’s summarise the above algorithms in a tabular form:
Type | Batch Size | Update Frequency | Memory Requirement | Convergence | Noise |
---|---|---|---|---|---|
Full-Batch | Entire Dataset | Once per epoch | High | Stable, slow | Low |
Mini-Batch | e.g., 32/64/128 | After each batch | Medium | Balanced | Medium |
Stochastic | 1 sample | After each sample | Low | Noisy, fast | High |
Gradient descent works by iteratively updating the model’s parameters every now and then to minimise the loss function. In each step, we calculate the gradient of the loss with respect to the model parameters and move towards the opposite direction of the gradient.
Update rule: θ = θ − η ⋅ ∇θJ(θ)
Where:
Imagine that you are blindfolded and trying to reach the lowest point on a playground slide. You take tiny steps downhill after feeling the slope with your feet. The steepness of the slope beneath your feet determines each step. Since we descend gradually, this is similar to gradient descent. The model moves in the direction of the greatest error reduction.
Full-batch descent is similar to using a giant slide map to determine your best course of action. You ask a friend where you want to go and then take a step in stochastic descent. Before acting, you confer with a small group in mini-batch descent.
Let X ∈ R n×d be the input data with n samples and d features.
Full-Batch Gradient Descent
Mini-Batch Gradient Descent
Consider attempting to estimate a product’s cost based on reviews.
It’s full-batch if you read all 1000 reviews before making a choice. Deciding after reading just one review is stochastic. A mini-batch is when you read a small number of reviews (say 32 or 64) and then estimate the price. Mini-batch strikes a good balance between being dependable enough to make wise decisions and quick enough to act quickly.
Mini-batch gives a good balance: it’s fast enough to act quickly and reliable enough to make smart decisions.
We will use PyTorch to demonstrate the difference between batch and mini-batch processing. Through this implementation, we will be able to understand how well these 2 algorithms help in converging to our most optimal global minima.
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt
# Create synthetic data
X = torch.randn(1000, 10)
y = torch.randn(1000, 1)
# Define model architecture
def create_model():
return nn.Sequential(
nn.Linear(10, 50),
nn.ReLU(),
nn.Linear(50, 1)
)
# Loss function
loss_fn = nn.MSELoss()
# Mini-Batch Training
model_mini = create_model()
optimizer_mini = optim.SGD(model_mini.parameters(), lr=0.01)
dataset = TensorDataset(X, y)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)
mini_batch_losses = []
for epoch in range(64):
epoch_loss = 0
for batch_X, batch_y in dataloader:
optimizer_mini.zero_grad()
outputs = model_mini(batch_X)
loss = loss_fn(outputs, batch_y)
loss.backward()
optimizer_mini.step()
epoch_loss += loss.item()
mini_batch_losses.append(epoch_loss / len(dataloader))
# Full-Batch Training
model_full = create_model()
optimizer_full = optim.SGD(model_full.parameters(), lr=0.01)
full_batch_losses = []
for epoch in range(64):
optimizer_full.zero_grad()
outputs = model_full(X)
loss = loss_fn(outputs, y)
loss.backward()
optimizer_full.step()
full_batch_losses.append(loss.item())
# Plotting the Loss Curves
plt.figure(figsize=(10, 6))
plt.plot(mini_batch_losses, label='Mini-Batch Training (batch_size=64)', marker='o')
plt.plot(full_batch_losses, label='Full-Batch Training', marker='s')
plt.title('Training Loss Comparison')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
Here, we can visualize training loss over time for both strategies to observe the difference. We can observe:
In real applications, mini-batches is often preferred for better generalisation and computational efficiency.
The batch size we set is a hyperparameter which has to be experimented with as per model architecture and dataset size. An effective manner to decide on an optimal batch size value is to implement the cross-validation strategy.
Here’s a table to help you make this decision:
Feature | Full-Batch | Mini-Batch |
Gradient Stability | High | Medium |
Convergence Speed | Slow | Fast |
Memory Usage | High | Medium |
Parallelization | Less | More |
Training Time | High | Optimized |
Generalization | Can overfit | Better |
Note: As discussed above, batch_size is a hyperparameter which has to be fine-tuned for our model training. So, it is necessary to know how lower batch size and higher batch size values perform.
Smaller batch size values would mostly fall under 1 to 64. Here, the faster updates take place since gradients are updated more frequently (per batch), the model starts learning early, and updates weights quickly. Constant weight updates mean more iterations for one epoch, which can increase computation overhead, increasing the training process time.
The “noise” in gradient estimation helps escape sharp local minima and overfitting, often leading to better test performance, hence showing better generalisation. Also, due to these noises, there can be unstable convergence. If the learning rate is high, these noisy gradients may cause the model to overshoot and diverge.
Think of small batch size as taking frequent but shaky steps toward your goal. You may not walk in a straight line, but you might discover a better path overall.
Larger batch sizes can be considered from a range of 128 and above. Larger batch sizes allow for more stable convergence since more samples per batch mean gradients are smoother and closer to the true gradient of the loss function. With smoother gradients, the model might not escape flat or sharp local minima.
Here, fewer iterations are needed to complete one epoch, hence allowing faster training. Large batches require more memory, which will require GPUs to process these huge chunks. Though each epoch is faster, it may take more epochs to converge due to smaller update steps and a lack of gradient noise.
Large batch size is like walking steadily towards our goal with preplanned steps, but sometimes you may get stuck because you don’t explore all the other paths.
Here’s a comprehensive table comparing full-batch and mini-batch training.
Aspect | Full-Batch Training | Mini-Batch Training |
Pros | – Stable and accurate gradients – Precise loss computation | – Faster training due to frequent updates – Supports GPU/TPU parallelism – Better generalisation due to noise |
Cons | – High memory consumption – Slower per-epoch training – Not scalable for big data | – Noisier gradient updates – Requires tuning of batch size – Slightly less stable |
Use Cases | – Small datasets that fit in memory – When reproducibility is important | – Large-scale datasets – Deep learning on GPUs/TPUs – Real-time or streaming training pipelines |
When choosing between batch and mini-batch training, consider the following:
Take into account the following when deciding between batch and mini-batch training:
Batch processing and mini-batch training are the must-know foundational concepts in deep learning model optimisation. While full-batch training provides the most stable gradients, it is rarely feasible for modern, large-scale datasets due to memory and computation constraints as discussed at the start. Mini-batch training on the other side brings the right balance, offering decent speed, generalisation, and compatibility with the help of GPU/TPU acceleration. It has thus become the de facto standard in most real-world deep-learning applications.
Choosing the optimal batch size is not a one-size-fits-all decision. It should be guided by the size of the dataset and the existing memory and hardware resources. The selection of the optimizer and the desired generalisation and convergence speed eg. learning_rate, decay_rate are also to be taken into account. We can create models more quickly, accurately, and efficiently by comprehending these dynamics and utilising tools like learning rate schedules, adaptive optimisers (like ADAM), and batch size tuning.