F1 Score

The F1-score is the harmonic mean of precision and recall. It is designed to provide a single number that captures the trade-off between these two key classification metrics. Precision measures how many of the predicted positives are correct, while recall measures how many of the actual positives are correctly identified.

‍

Why it matters
In imbalanced datasets, accuracy alone is often misleading. For example, in fraud detection, a model could achieve 99% accuracy simply by predicting “no fraud” every time, but it would miss all actual fraud cases. The F1-score balances the ability to catch positive cases (recall) with the ability to avoid false alarms (precision).

‍

Applications

Healthcare AI: predicting diseases where both false negatives and false positives have serious consequences.
Search engines & recommendation systems: balancing between retrieving all relevant results and filtering out noise.
Cybersecurity: ensuring robust detection while minimizing false positives that could overwhelm analysts.

‍

Limitations

It does not consider true negatives.
May not reflect business priorities—for instance, in medical screening recall may be far more critical than precision.

‍

The F1-score is particularly useful when you need a balanced view of a classifier’s performance. By taking the harmonic mean, it punishes extreme differences between precision and recall. For instance, if precision is very high but recall is very low, the F1-score will be closer to the lower value, highlighting the imbalance.

‍

Beyond the basic F1-score, there are variants like Fβ-scores, where β allows you to give more weight to recall or precision depending on the problem. This is valuable in domains such as medical diagnostics (prioritising recall) or spam filtering (prioritising precision).

‍

It’s worth noting that the F1-score is class-specific in multi-class settings. To evaluate overall performance, practitioners often compute macro-F1 (average across classes), micro-F1 (weighted by support), or weighted-F1. These choices can significantly affect the interpretation of results, especially with highly imbalanced classes.

‍

References

Powers, D.M.W. (2011), Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness and Correlation.