Variance
What is variance in Machine Learning?
Variance measures how much a model’s predictions fluctuate if we train it on different subsets of the same dataset. High variance means the model is too sensitive, capturing noise instead of the underlying patterns.
Variance vs Bias: the classic trade-off
- High variance, low bias → The model memorizes the training set but fails to generalize (overfitting). Example: a decision tree grown too deep.
- Low variance, high bias → The model is too simplistic, missing important patterns (underfitting). Example: using a linear model for a highly non-linear problem.
The goal is not to eliminate variance entirely but to control it so that the model remains flexible yet stable.
Real-world implications
- Healthcare: a high-variance diagnostic model may perform well in one hospital dataset but fail in another.
- Finance: a trading algorithm might show excellent backtesting results but collapse in live markets.
Mitigation techniques
- Cross-validation to detect variance early.
- Regularization and pruning to simplify complex models.
- Ensemble methods (bagging, random forests) that reduce variance by combining multiple learners.
In machine learning, variance reflects the sensitivity of a model to the specific training data it has seen. A high-variance model essentially “chases” the training set too closely, fitting even small idiosyncrasies that may not appear in future data. This leads to overfitting—strong training performance but poor generalization.
Variance is often studied together with bias in what is known as the bias–variance trade-off. While high variance causes overfitting, low variance combined with high bias can lead to underfitting. Effective models strike a balance: flexible enough to capture patterns, but regularized enough to ignore noise.
From a practical standpoint, variance can be reduced by techniques such as cross-validation, ensemble methods (e.g., bagging, random forests), and regularization. Monitoring variance is especially crucial in domains like healthcare or autonomous driving, where overfitted models may fail under real-world variability, leading to dangerous consequences.
📖 References
- Bias–variance tradeoff (Wikipedia)
- Domingos, P. (2012). A few useful things to know about machine learning. CACM.