F1 Score
The F1-score is the harmonic mean of precision and recall. It is designed to provide a single number that captures the trade-off between these two key classification metrics. Precision measures how many of the predicted positives are correct, while recall measures how many of the actual positives are correctly identified.
Why it matters
In imbalanced datasets, accuracy alone is often misleading. For example, in fraud detection, a model could achieve 99% accuracy simply by predicting “no fraud” every time, but it would miss all actual fraud cases. The F1-score balances the ability to catch positive cases (recall) with the ability to avoid false alarms (precision).
Applications
- Healthcare AI: predicting diseases where both false negatives and false positives have serious consequences.
- Search engines & recommendation systems: balancing between retrieving all relevant results and filtering out noise.
- Cybersecurity: ensuring robust detection while minimizing false positives that could overwhelm analysts.
Limitations
- It does not consider true negatives.
- May not reflect business priorities—for instance, in medical screening recall may be far more critical than precision.
The F1-score is particularly useful when you need a balanced view of a classifier’s performance. By taking the harmonic mean, it punishes extreme differences between precision and recall. For instance, if precision is very high but recall is very low, the F1-score will be closer to the lower value, highlighting the imbalance.
Beyond the basic F1-score, there are variants like Fβ-scores, where β allows you to give more weight to recall or precision depending on the problem. This is valuable in domains such as medical diagnostics (prioritising recall) or spam filtering (prioritising precision).
It’s worth noting that the F1-score is class-specific in multi-class settings. To evaluate overall performance, practitioners often compute macro-F1 (average across classes), micro-F1 (weighted by support), or weighted-F1. These choices can significantly affect the interpretation of results, especially with highly imbalanced classes.
References
- Powers, D.M.W. (2011), Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation.