Label Noise
Label noise refers to errors or inconsistencies in the training data labels that compromise the quality of supervised learning models.
Why is it such a hot research topic?
Recent studies highlight that modern deep neural networks can easily memorize noisy labels, leading to poor generalization. In fact, Zhang et al. (2017) demonstrated that large models are so flexible they can fit random labels perfectly — which raises concerns about robustness.
Sources of label noise
- Human annotators: fatigue, lack of domain expertise.
- Ambiguity in data: is a tweet “sarcasm” or “anger”?
- Automatic pipelines: weak supervision and heuristic labeling.
Current research directions
- Noise-tolerant loss functions: e.g. generalized cross-entropy that reduces the effect of mislabeled data.
- Co-teaching methods: two networks teach each other by focusing on clean samples.
- Active learning: selectively querying human experts for the most uncertain labels.
Why it matters
Label noise is not just a technical nuisance — it is a barrier to deploying AI in safety-critical domains. In healthcare, mislabeled X-rays may cause a model to miss early cancer detection. In autonomous driving, incorrect labels for “pedestrian vs. background” could have tragic consequences.
Label noise, or noise in annotations, is one of the most underestimated challenges in machine learning. While data collection pipelines often focus on volume, the hidden cost of mislabeled samples can heavily distort training. In sensitive fields like medical imaging or autonomous driving, even a small percentage of mislabeled data can propagate into systematic errors, weakening the trust in deployed systems.
There are different types of label noise:
- Random noise — errors distributed without clear pattern, often from inattentive annotation.
- Systematic noise — consistent mistakes tied to bias or flawed guidelines (e.g., always confusing two tumor types).
- Adversarial noise — deliberate mislabeling, which is a growing concern in cybersecurity and content moderation.
Modern approaches to handling label noise include robust loss functions that down-weight suspicious samples, semi-supervised learning where the model re-estimates uncertain labels, and techniques like co-teaching, where two networks learn in parallel and discard data points the other finds unreliable. These methods do not eliminate noise but reduce its destructive impact.
Ultimately, label noise highlights the central truth of AI: quality matters more than quantity. The most sophisticated architectures cannot overcome a foundation of flawed supervision, which is why human oversight and thoughtful dataset design remain indispensable.
📖 References
- Zhang, C., et al. (2017). Understanding deep learning requires rethinking generalization. ICLR.
- Label Noise in Machine Learning – ArXiv survey