Mini-Batch Gradient Descent
Mini-batch gradient descent is an optimization method widely used in training deep learning models. Instead of processing the full dataset at once (batch gradient descent) or updating after every single sample (stochastic gradient descent), it splits the dataset into smaller subsets called mini-batches.
Why is it useful?
It strikes the balance: efficient updates without requiring the computational cost of a full pass over the data. Moreover, the slight randomness introduced by mini-batches often helps the optimizer escape local minima and find better generalizations.
Where is it applied?
- Training large-scale deep neural networks (e.g. ResNet, BERT).
- Computer vision tasks, where datasets like ImageNet are too massive for full-batch updates.
- Reinforcement learning, where agents update policies based on sampled experience buffers.
What are the trade-offs?
Batch size selection is crucial. Small mini-batches give noisy but fast updates; larger ones stabilize learning but consume more GPU memory. Researchers even debate whether "generalization gaps" appear when using extremely large batch sizes.
Mini-batch gradient descent has become the workhorse of deep learning, striking a balance between efficiency and statistical reliability. Using the entire dataset at once (batch gradient descent) is often too expensive, while updating parameters after each single example (stochastic gradient descent) can be noisy and unstable. Mini-batches hit the sweet spot: small enough to introduce useful randomness, large enough to ensure stable progress.
A key consideration is the choice of batch size. Smaller batches tend to generalize better because of the noise they inject into the updates, which acts as a form of regularization. Larger batches, on the other hand, can speed up training thanks to parallelization on GPUs, but they sometimes converge to sharp minima that hurt generalization. Researchers still debate the “optimal” size, and it often depends on the model, dataset, and hardware.
Beyond efficiency, mini-batch training has practical implications. It enables shuffling (which reduces bias from the order of data), integrates naturally with momentum and adaptive optimizers like Adam, and scales well to distributed computing environments. Without mini-batching, modern large-scale training of LLMs or image models would simply not be feasible.
🔗 References:
- Goodfellow et al., Deep Learning (MIT Press, 2016)
- Stanford CS231n – Optimization