By clicking "Accept", you agree to the storing of cookies on your device to enhance site navigation, analyze site usage, and assist in our marketing efforts. See our Privacy Policy for more information
Glossary
Data Leakage
AI DEFINITION

Data Leakage

Data leakage occurs when information from outside the training dataset (such as from the test set or future data) is inappropriately used during model training. This leads to overoptimistic performance estimates that fail to generalize to unseen data.

Examples

Consequences

  • Inflated evaluation metrics (Accuracy, F1-score, etc.).
  • Poor real-world performance once deployed.
  • Loss of credibility in model validation and governance processes.

Prevention strategies

Data leakage is sometimes called the “hidden trap” of machine learning. What makes it so dangerous is that it often goes unnoticed until the model is deployed, at which point its performance collapses. In practice, leakage is not always obvious—it can come from subtle correlations between features and labels that only exist in the dataset but not in real-world conditions.

One common form of leakage happens through data preprocessing steps. For example, if normalization parameters (mean, variance) are computed using the entire dataset instead of just the training split, information from the test set inadvertently influences training. Another subtle case is when duplicate or near-duplicate records appear in both train and test splits.

Preventing leakage requires not just technical controls but also strong collaboration with domain experts. They can often spot variables that are unrealistic in practice (such as including “time of discharge” to predict readmission). Ultimately, avoiding leakage is about respecting the temporal and causal order of information—training models only with what would truly be available at prediction time.

References

  • Kaufman, S., Rosset, S., Perlich, C., & Stitelman, O. (2011). Leakage in Data Mining: Formulation, Detection, and Avoidance.