Home/Blog/Machine Learning/Comparison of Evaluation Metrics used in Machine Learning Models

Comparison of Evaluation Metrics used in Machine Learning Models

10 min read

Apr 10, 2025

content

The confusion matrix

Model evaluations

Accuracy

Precision and recall

Precision

Recall

Precision vs. Recall—Adjusting the threshold

F1 score

ROC curve (Receiver-operating characteristic curve):

AUC (Area under the curve)

Matthews correlation coefficient (MCC)

Sensitivity and specificity

Sensitivity

Specificity

How to choose the right evaluation metric for your model

Summary

Evaluation metrics are critical for assessing machine learning model performance—how accurate the predictions are, how errors are distributed, and which models outperform others. Whether fine-tuning a facial recognition system or diagnosing diseases, the right metrics help you make informed decisions. In this blog, we’ll break down evaluation metrics, explain when to use each, and why accuracy alone isn’t always enough.

Let’s begin with a simple example to understand these evaluation metrics. I want my model to recognize the pictures in which I am present. I train my machine learning model to work on my pictures and learn my facial features to differentiate between images containing my face and those not.

After the training, I took some pictures. I was in some of them and wasn’t there in the rest. For each image, I asked my model if it could find me in the picture. A true response would mean it matched my facial features in that image, and a false would mean the opposite. A positive answer would mean the model’s response is correct, and a negative would mean otherwise. This can be summarized as a matrix, popularly known as the confusion matrix.

The confusion matrix#

The confusion matrix can be understood with the help of the table below:

In the table above, green cells highlight the correct predictions, whereas the red cells highlight the incorrect predictions. TP and TN are the counts of samples identified correctly, and FN and FP are the samples identified incorrectly.

Model evaluations#

Now, after I collect these four numbers, TP, TN, FN, FP, I want to know my model’s performance. Usually, the first choice is to look at the model’s accuracy, as it’s one of the easiest evaluation techniques.

Accuracy#

Accuracy refers to the percentage of correct predictions out of the total.

The total correct predictions are $TP+TN.$

The total number of samples is $TP+TN+FN+FP.$

Therefore, Accuracy = $\frac{TP+TN}{TP+TN+FN+FP} \times 100\%.$

This is a good measure when the samples are distributed evenly in all the classes, i.e., half the images contain my face, and half the images don’t. However, in cases where the samples are not distributed evenly, the results might not provide enough insight into the model’s performance. Here’s why!

Suppose I use pictures from my albums, with my face in 95% of the images. A model that has not been trained and only returns a True response every time will give an accuracy of 95%.

In everyday scenarios such as video footage from roads or banks or medical images, where the samples are not distributed evenly, accuracy is not a good performance measure.

This begs the question, if not accuracy, then what? Luckily, we have several alternate options–performance measures–at our disposal.

Let’s examine some commonly used terms, understand their meanings, and consider their advantages.

Precision and recall#

Precision and recall are some of the terms commonly used when discussing model performance. Let’s examine them one by one.

Precision#

Precision measures what percentage of all positive predictions were indeed positive. The formula for calculating precision is given below.

$Precision = \frac{TP}{TP+FP}\times100\%.$

The cells used in the formula are highlighted in the confusion matrix below:

This formula looks at the case when the model gives a positive prediction. Consider a case when this model is used to recommend when a particular piece of mechanical equipment needs to be repaired. False positive cases only increase the expense if the repair cost is high. In such cases, we need precision to be as close to 100%.

Recall#

Recall measures what percentage of all positive cases were correctly predicted as positive. The formula for calculating recall is given below.

$Recall = \frac{TP}{TP+FN}\times100\%.$

The cells used in the formula are highlighted in the confusion matrix below.

This formula looks at cases when the actual result is positive. Consider a case when this model identifies diseases and illnesses such as diabetic retinopathy or cancer. A missed diagnosis (false negative) is not good for the patient as it can cause a delay in treatment. In such cases, we want recall to be as close as 100%.

Before discussing the need for more metrics, let’s compare precision and recall and see how improving one affects the other.

Precision vs. Recall—Adjusting the threshold#

We can raise or lower the threshold for classification. If the model’s decision boundary is lowered (i.e., we make it easier for a sample to be classified as positive), recall increases because more instances are predicted as positive. However, this often leads to more false positives, which decreases precision.

On the other hand, If we increase the threshold (i.e., we make the model more selective about predicting positives), precision increases because fewer false positives are predicted. However, this might lead to more false negatives, causing recall to decrease.

As an example, consider a model that classifies whether an email is spam or not:

If we set the threshold very low (e.g., classify anything with a slight chance of being spam as spam), we’ll capture more spam (higher recall), but we might also classify a lot of non-spam emails as spam (lower precision). On the other hand, if we set the threshold high (e.g., classify only emails that are very likely to be spam as spam), we’ll catch fewer false positives (higher precision), but some spam emails will slip through (lower recall).

F1 score#

Mean, median, and mode are commonly used measures for calculating averages. The harmonic mean is another way to calculate the average of two numbers. The harmonic mean of two numbers, a and b, is $2/(\frac{1}{a}+\frac{1}{b})$ .

The F1 score is the harmonic mean of precision and recall. A good F1 score would imply that both precision and recall are good. In mathematical terms, this simply means the following: $F1\hspace{0.2em}score =\frac{2}{\frac{1}{precision}+\frac{1}{recall}}\times100\% = \frac{2\cdot precision \cdot recall} {precision+recall} \times 100\%.$

Regarding the variables defined in the confusion matrix, the F1 score is as follows.

$F1\hspace{0.2em}score =\frac{2TP}{2TP+FP+FN}\times100\%.$

The F1 score balances precision and recall using their harmonic mean. Unlike a regular average, the harmonic mean gives more weight to smaller values, ensuring neither precision nor recall is overlooked. This makes the F1 score useful when both false positives and false negatives are critical, such as in disease diagnosis or fraud detection.

ROC curve (Receiver-operating characteristic curve):#

The ROC curve is a graphical plot that illustrates the performance of a binary classification model across all possible classification thresholds.

The x-axis represents the false positive rate (FPR), the proportion of negative samples incorrectly classified as positive. It is calculated as:

The ROC curve shows how the model’s TPR and FPR change as the decision threshold varies. It helps visualize the trade-off between sensitivity (recall) and specificity (1 - FPR) at various thresholds.

AUC (Area under the curve)#

The AUC is a numerical summary that quantifies the overall performance of a classifier based on its ROC curve. It represents the area under the ROC curve. The AUC ranges from 0 to 1. AUC = 1 indicates a perfect classifier (100% true positive rate and 0% false positive rate). AUC = 0.5 indicates a classifier with no discriminative power, i.e., the model is essentially guessing. AUC < 0.5 suggests the model is worse than random, which might happen if the model is consistently predicting the wrong class (inverted). Higher AUC values correspond to better model performance.

Imagine you’re building a model to detect spam emails. If your model has an AUC of 0.95, there’s a 95% chance that it will rank a randomly chosen spam email higher than a randomly chosen non-spam email. AUC helps quantify how well your model separates positive and negative classes across all thresholds, making it a go-to metric for tasks where ranking matters.

Matthews correlation coefficient (MCC)#

The Matthews correlation coefficient (MCC) is a performance metric for evaluating the quality of binary classification models. It is particularly useful for imbalanced datasets, where traditional metrics like accuracy, precision, and recall might not provide a complete picture of a model’s performance.

Let’s say you’re building a model to detect rare diseases where only 1% of the population is positive. A naive model that predicts negative for everyone will achieve 99% accuracy—but it’s completely useless. Conversely, MCC penalizes models for false positives and false negatives, giving you a more realistic sense of performance.

MCC is a correlation coefficient that calculates the relationship between the actual and predicted classifications. It takes into account all four values in the confusion matrix: True positives (TP), True negatives (TN), False positives (FP), and False negatives (FN).

The formula for MCC is as follows:

The Matthews correlation coefficient (MCC) is a performance metric for binary classification models, particularly useful in imbalanced datasets. Unlike metrics such as accuracy, precision, or recall, MCC considers all four components of the confusion matrix: True positives (TP), True negatives (TN), False positives (FP), and False negatives (FN). The formula for MCC calculates a balanced measure of model performance, ranging from -1 to +1, where +1 indicates perfect classification, 0 indicates random performance, and -1 represents complete misclassification.

One of MCC’s main advantages is its ability to handle imbalanced datasets effectively, where other metrics may give misleading results. In scenarios like fraud detection or medical diagnostics, where both false positives and false negatives have significant consequences, MCC provides a more reliable evaluation by factoring in both correct and incorrect classifications of both classes. Overall, MCC offers a comprehensive measure of model performance, especially when both classes in a binary classification problem are important.

Sensitivity and specificity#

Sensitivity and specificity are the other words, besides precision and recall, used when discussing model performance. Let’s look at them one by one.

Sensitivity#

A few other names also know sensitivity. Some researchers like to use the term true positive rate, which looks at the cases where the result is positive. Sensitivity is the percentage of correct results when we only consider cases where the actual result is positive. We have already read something similar to this earlier in this blog. Sensitivity is the same as recall.

Specificity#

Specificity, also known as the true negative rate, looks at the cases where the actual output is false. This is similar to the true positive rate. In this case, the focus has shifted from the positive class to the negative class. It can be calculated using the formula given below:

$Specificity =\frac{TN}{TN+FP}\times100\%.$

The cells used in the formula are highlighted in the confusion matrix below:

Metric	Best Use Case	Avoid When
Accuracy	Balanced datasets with equal cost for errors	Imbalanced datasets
Precision	High false positive cost (e.g., fraud detection)	High false negative cost (e.g., medical)
Recall	High false negative cost (e.g., disease detection)	False positives matter more
F1 Score	Balancing precision and recall equally	One metric is far more important
Specificity	Avoiding false positives (e.g., spam detection)	False negatives are more critical
AUC	Comparing classifiers across all thresholds	Single threshold performance needed
MCC	Imbalanced datasets, rare event detection	The dataset is balanced or evenly split

These metrics help evaluate your model under different circumstances. For a hands-on understanding, let’s dive into a project to predict income levels based on census data—putting theory into practice!

Understanding the strengths and weaknesses of different evaluation metrics is critical for building robust machine learning models. Accuracy might be enough for balanced datasets, but real-world problems often demand deeper insights through precision, recall, or the F1 score.

Now that you’ve mastered the theory, why not apply it? Build an income classification project using census data and see how these metrics impact your model’s performance!

Written By:

Khawaja Muhammad Fahd

Free AI Mock Interviews

Coding Interview

Coding PatternsFree Interview

Gain insights and practical experience with coding patterns through targeted MCQs and coding problems, designed to match and challenge your expertise level.

System Design

YouTubeFree Interview

Learn to design a video streaming platform like YouTube by tackling functional and non-functional requirements, core components, and high-level to detailed design challenges.

Free Resources

	Correct Answer: Positive (Face Present)	Correct Answer: Negative (Face Not Present)
Model Prediction: Positive (Face Present)	True positive (TP): The model correctly identifies that the face is present.	False positive (FP): The model incorrectly identifies that the face is present.
Model Prediction: Negative (Face Not Present)	False negative (FN): The model incorrectly identifies that the face is not present.	True negative (TN): The model correctly identifies that the face is not present.

	Correct Answer: Positive (Face Present)	Correct Answer: Negative (Face Not Present)
Model Prediction: Positive (Face Present)	True positive (TP): The model correctly identifies that the face is present.	False positive (FP): The model incorrectly identifies that the face is present.
Model Prediction: Negative (Face Not Present)	False negative (FN): The model incorrectly identifies that the face is not present.	True negative (TN): The model correctly identifies that the face is not present.

	Correct Answer: Positive (Face Present)	Correct Answer: Negative (Face Not Present)
Model Prediction: Positive (Face Present)	True positive (TP): The model correctly identifies that the face is present.	False positive (FP): The model incorrectly identifies that the face is present.
Model Prediction: Negative (Face Not Present)	False negative (FN): The model incorrectly identifies that the face is not present.	True negative (TN): The model correctly identifies that the face is not present.

	Correct Answer: Positive (Face Present)	Correct Answer: Negative (Face Not Present)
Model Prediction: Positive (Face Present)	True positive (TP): The model correctly identifies that the face is present.	False positive (FP): The model incorrectly identifies that the face is present.
Model Prediction: Negative (Face Not Present)	False negative (FN): The model incorrectly identifies that the face is not present.	True negative (TN): The model correctly identifies that the face is not present.

Evaluation Metric	Alternate Names	Formula
Accuracy	-	(TP+TF)/(TP+FP+FN+TN) x100%
Precision	-	TP/(TP+FP) x100%
Recall	Sensitivity, true positive rate	TP/(TP+FN) x100%
F1 Score	-	TP/(TP+(FP+FN)/2)x100%
Specificity	True negative rate	TN/(TN+FP) x100%
AUC	Area under the curve	The area under the ROC curve
MCC	Matthew correlation coefficient	(TP X TN - FP X FN) / sqrt[(TP+FP)(TP+FN)(TN+FP)(TN+FN)]