What is the F1 score?

Explore the F1 score concept to evaluate classifier performance by balancing precision and recall. Understand its calculation for binary and multi-class models and learn how it provides a single metric that fairly compares different classifiers.

We'll cover the following...

What is the F1-score?
F1-score for a binary classifier
F1-score for a multi-class classifier
Overall model F1-score
Implementing F1 score in Python
Conclusion

As data scientists, we frequently need to evaluate how well a classifier is performing. Two of the most important metrics we use are precision and recall. Precision tells us how trustworthy our positive predictions are, and recall tells us how many of the actual positives our model managed to catch.

But here’s where things get tricky. These two metrics often work against each other. When two classifiers trade off against each other, where one has higher recall and the other has higher precision, which one is actually better?

This is exactly the problem the F1-score was designed to solve.

What is the F1-score?

The F1-score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. It is primarily used to compare the performance of two classifiers. Suppose classifier A has higher recall, and classifier B has a higher precision. In this case, the F1-scores for both classifiers can be used to determine which one produces better results overall.

The F1-score of a classification model is calculated as follows:

The F1-score of $78.79 \%$ gives us a single, balanced number that accounts for both how precise the classifier is and how well it captures actual positives.

F1-score for a multi-class classifier

Things get a bit more involved when we move beyond binary classification. When a model predicts across multiple classes, we calculate the F1-score separately for each class and then combine them.

Assume we have already calculated the following precision and recall values for a three-class classifier:

This gives us a single number representing the model's average performance across all classes, a useful starting point for comparing models or tracking improvements over time.

Keep in mind that when classes are imbalanced (i.e., some classes have far more examples than others), a simple arithmetic mean may not tell the full story. In those cases, a weighted average, where each class F1-score is weighted by how many samples that class contains, gives a more representative picture of overall model performance.

Implementing F1 score in Python

Now that we understand what the F1-score is, let's look at how to compute it from scratch. we first need to calculate precision and recall from the raw predictions of our classifier. Both depend on three values we can extract from our model's output:

True Positives (TP): cases the model correctly predicted as positive
False Positives (FP): cases the model incorrectly predicted as positive
False Negatives (FN): positive cases the model missed

Let's say we have the actual labels and the predicted labels from a binary classifier:

Python 3.14.0

# Actual and predicted labels
y_true = [1, 0, 1, 1, 0, 1, 0, 0, 1, 1]
y_pred = [1, 0, 1, 0, 0, 1, 1, 0, 1, 0]
# Compute TP, FP, FN
# zip() function: pairs elements from two lists (y_true and y_pred) 
# so we can iterate over them simultaneously.
# Example: zip([1,0,1], [1,1,0]) → [(1,1), (0,1), (1,0)]
# sum() function: adds up all values in an iterable. 
# Here we use it with a generator expression that yields 1 for every condition met.
# Generator expression: (1 for yt, yp in zip(y_true, y_pred) if condition)
# It generates a 1 for each pair (yt, yp) that satisfies the condition.
# sum() then counts how many 1's were generated, effectively giving us TP, FP, or FN.
# Compute True Positives (TP)
# Count of instances where the true label is positive (1) 
# AND the predicted label is also positive (1)
TP = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 1)
FP = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 0 and yp == 1)
FN = sum(1 for yt, yp in zip(y_true, y_pred) if yt == 1 and yp == 0)
# Compute precision and recall
precision = TP / (TP + FP)
recall    = TP / (TP + FN)
# Compute F1-score
f1 = 2 * (precision * recall) / (precision + recall)
print(f"Precision : {precision:.2f}")
print(f"Recall    : {recall:.2f}")
print(f"F1-Score  : {f1:.2f}")

	Predicted Positive	Predicted Negative
Actual Positive	65 (TP)	15 (FN)
Actual Negative	20 (FP)	100 (TN)

Class	Precision	Recall
A	84%	80%
B	79%	80%
C	69%	73%

1. Dive into Data Science

2.Talk to Data

3.Clean It Up

4.Make Sense of Data

5.Build Smart Stuff

Mock Interview

6.Conclusion

7.Appendix

Mock Interview

What is the F1 score?

What is the F1-score?

F1-score for a binary classifier

F1-score for a multi-class classifier

Overall model F1-score

Implementing F1 score in Python

Conclusion