Search⌘ K
AI Features

What is the F1 score?

Explore the F1 score concept to evaluate classifier performance by balancing precision and recall. Understand its calculation for binary and multi-class models and learn how it provides a single metric that fairly compares different classifiers.

As data scientists, we frequently need to evaluate how well a classifier is performing. Two of the most important metrics we use are precision and recall. Precision tells us how trustworthy our positive predictions are, and recall tells us how many of the actual positives our model managed to catch.

But here’s where things get tricky. These two metrics often work against each other. When two classifiers trade off against each other, where one has higher recall and the other has higher precision, which one is actually better?

This is exactly the problem the F1-score was designed to solve.

What is the F1-score?

The F1-score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. It is primarily used to compare the performance of two classifiers. Suppose classifier A has higher recall, and classifier B has a higher precision. In this case, the F1-scores for both classifiers can be used to determine which one produces better results overall.

The F1-score of a classification model is calculated as follows:

Where:

  • P = the precision of the classification model

  • R = the recall of the classification model

1.

Why does the F1-score use the harmonic mean instead of the arithmetic mean?

Show Answer
Did you find this helpful?

F1-score for a binary classifier

While the F1 score provides a single metric to evaluate overall performance, its practical interpretation becomes clearer when applied to a specific classifier scenario. Let's walk through a concrete example. Consider the following binary classification problem, where our goal is to correctly identify positive instances from a dataset. By calculating precision, recall, and the F1 score, we can see exactly how well the classifier performs.


Predicted Positive

Predicted Negative

Actual Positive

65 (TP)

15 (FN)

Actual Negative

20 (FP)

100 (TN)

Let’s calculate precision and recall using confusion matrix.

  • Precision = TPTP+FP=6565+20=76.47%\frac{TP}{TP + FP} = \frac{65}{65 + 20} = 76.47 \%

  • Recall = TPTP+FN=6565+15=81.25%\frac{TP}{TP + FN} = \frac{65}{65 + 15} = 81.25 \%

Plugging these values into our formula:

The F1-score of 78.79%78.79 \% gives us a single, balanced number that accounts for both how precise the classifier is and how well it captures actual positives.

F1-score for a multi-class classifier

Things get a bit more involved when we move beyond binary classification. When a model predicts across multiple classes, we calculate the F1-score separately for each class and then combine them.

Assume we have already calculated the following precision and recall values for a three-class classifier:

Class

Precision

Recall

A

84%

80%

B

79%

80%

C

69%

73%

Class A:

Class B:

Class C:

From these calculations, we can see that the classifier performs best for class AA, with an F1-score of 81.95%81.95 \%, and struggles most with class CC, where the score drops to 70.94%70.94 \%.

Overall model F1-score

One straightforward way to summarize performance across all classes is to take the arithmetic mean of the individual class F1-scores:

This gives us a single number representing the model's average performance across all classes, a useful starting point for comparing models or tracking improvements over time.

Keep in mind that when classes are imbalanced (i.e., some classes have far more examples than others), a simple arithmetic mean may not tell the full story. In those cases, a weighted average, where each class F1-score is weighted by how many samples that class contains, gives a more representative picture of overall model performance.

Implementing F1 score in Python

Now that we understand what the F1-score is, let's look at how to compute it from scratch. we first need to calculate precision and recall from the raw predictions of our classifier. Both depend on three values we can extract from our model's output:

  • True Positives (TP): cases the model correctly predicted as positive

  • False Positives (FP): cases the model incorrectly predicted as positive

  • False Negatives (FN): positive cases the model missed

Let's say we have the actual labels and the predicted labels from a binary classifier:

Python 3.14.0
# Actual and predicted labels
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
# Compute TP, FP, FN
# zip() function: pairs elements from two lists (y_true and y_pred)
# so we can iterate over them simultaneously.
# Example: zip([1,0,1], [1,1,0]) → [(1,1), (0,1), (1,0)]
# sum() function: adds up all values in an iterable.
# Here we use it with a generator expression that yields 1 for every condition met.
# Generator expression: (1 for yt, yp in zip(y_true, y_pred) if condition)
# It generates a 1 for each pair (yt, yp) that satisfies the condition.
# sum() then counts how many 1's were generated, effectively giving us TP, FP, or FN.
# Compute True Positives (TP)
# Count of instances where the true label is positive (1)
# AND the predicted label is also positive (1)
TP = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 1)
FP = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 0 and yp == 1)
FN = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 0)
# Compute precision and recall
precision = TP / (TP + FP)
recall = TP / (TP + FN)
# Compute F1-score
f1 = 2 * (precision * recall) / (precision + recall)
print(f"Precision : {precision:.2f}")
print(f"Recall : {recall:.2f}")
print(f"F1-Score : {f1:.2f}")

Notice that the F1-score of 0.730.73 lands between precision and recall, but closer to the lower value. This is the harmonic mean doing its job, as it pulls the result toward whichever metric is weaker and penalises classifiers that perform well on one metric while neglecting the other.

1.

What does a F1 score of 0.5 mean?

Show Answer
1 / 2

Conclusion

The F1-score bridges the gap between precision and recall, giving us a single balanced metric that captures both how accurate a model's positive predictions are and how well it identifies all actual positives. Whether we're working with a binary classifier or a multi-class problem, the F1-score helps us cut through the noise and make fair, meaningful comparisons between models.