What is the F1 score?
Explore the F1 score concept to evaluate classifier performance by balancing precision and recall. Understand its calculation for binary and multi-class models and learn how it provides a single metric that fairly compares different classifiers.
As data scientists, we frequently need to evaluate how well a classifier is performing. Two of the most important metrics we use are precision and recall. Precision tells us how trustworthy our positive predictions are, and recall tells us how many of the actual positives our model managed to catch.
But here’s where things get tricky. These two metrics often work against each other. When two classifiers trade off against each other, where one has higher recall and the other has higher precision, which one is actually better?
This is exactly the problem the F1-score was designed to solve.
What is the F1-score?
The F1-score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. It is primarily used to compare the performance of two classifiers. Suppose classifier A has higher recall, and classifier B has a higher precision. In this case, the F1-scores for both classifiers can be used to determine which one produces better results overall.
The F1-score of a classification model is calculated as follows:
Where:
P = the precision of the classification model
R = the recall of the classification model
Why does the F1-score use the harmonic mean instead of the arithmetic mean?
F1-score for a binary classifier
While the F1 score provides a single metric to evaluate overall performance, its practical interpretation becomes clearer when applied to a specific classifier scenario. Let's walk through a concrete example. Consider the following binary classification problem, where our goal is to correctly identify positive instances from a dataset. By calculating precision, recall, and the F1 score, we can see exactly how well the classifier performs.
Predicted Positive | Predicted Negative | |
Actual Positive | 65 (TP) | 15 (FN) |
Actual Negative | 20 (FP) | 100 (TN) |
Let’s calculate precision and recall using confusion matrix.
Precision =
Recall =
Plugging these values into our formula:
The F1-score of
F1-score for a multi-class classifier
Things get a bit more involved when we move beyond binary classification. When a model predicts across multiple classes, we calculate the F1-score separately for each class and then combine them.
Assume we have already calculated the following precision and recall values for a three-class classifier:
Class | Precision | Recall |
A | 84% | 80% |
B | 79% | 80% |
C | 69% | 73% |
Class A:
Class B:
Class C:
From these calculations, we can see that the classifier performs best for class
Overall model F1-score
One straightforward way to summarize performance across all classes is to take the arithmetic mean of the individual class F1-scores:
This gives us a single number representing the model's average performance across all classes, a useful starting point for comparing models or tracking improvements over time.
Keep in mind that when classes are imbalanced (i.e., some classes have far more examples than others), a simple arithmetic mean may not tell the full story. In those cases, a weighted average, where each class F1-score is weighted by how many samples that class contains, gives a more representative picture of overall model performance.
Implementing F1 score in Python
Now that we understand what the F1-score is, let's look at how to compute it from scratch. we first need to calculate precision and recall from the raw predictions of our classifier. Both depend on three values we can extract from our model's output:
True Positives (TP): cases the model correctly predicted as positive
False Positives (FP): cases the model incorrectly predicted as positive
False Negatives (FN): positive cases the model missed
Let's say we have the actual labels and the predicted labels from a binary classifier:
Notice that the F1-score of
What does a F1 score of 0.5 mean?
Conclusion
The F1-score bridges the gap between precision and recall, giving us a single balanced metric that captures both how accurate a model's positive predictions are and how well it identifies all actual positives. Whether we're working with a binary classifier or a multi-class problem, the F1-score helps us cut through the noise and make fair, meaningful comparisons between models.