Data Leakage

Data leakage occurs when information from outside the training dataset (such as from the test set or future data) is inappropriately used during model training. This leads to overoptimistic performance estimates that fail to generalize to unseen data.

‍

Examples

Including target-related variables in the feature set (e.g., predicting hospital readmission using length of stay as an input).
Train/test contamination, where the same records are present in both sets.
Time series forecasting that accidentally uses future values when predicting the past.

‍

Consequences

Inflated evaluation metrics (Accuracy, F1-score, etc.).
Poor real-world performance once deployed.
Loss of credibility in model validation and governance processes.

‍

Prevention strategies

Careful data pipeline design and feature selection.
Strict separation between training, validation, and test datasets.
Using domain knowledge to identify "impossible" features.

‍

Data leakage is sometimes called the “hidden trap” of machine learning. What makes it so dangerous is that it often goes unnoticed until the model is deployed, at which point its performance collapses. In practice, leakage is not always obvious—it can come from subtle correlations between features and labels that only exist in the dataset but not in real-world conditions.

‍

One common form of leakage happens through data preprocessing steps. For example, if normalization parameters (mean, variance) are computed using the entire dataset instead of just the training split, information from the test set inadvertently influences training. Another subtle case is when duplicate or near-duplicate records appear in both train and test splits.

‍

Preventing leakage requires not just technical controls but also strong collaboration with domain experts. They can often spot variables that are unrealistic in practice (such as including “time of discharge” to predict readmission). Ultimately, avoiding leakage is about respecting the temporal and causal order of information—training models only with what would truly be available at prediction time.

‍

References

Kaufman, S., Rosset, S., Perlich, C., & Stitelman, O. (2011). Leakage in Data Mining: Formulation, Detection, and Avoidance.