Evaluation
Learn how to assess model performance.
As data scientists, our journey doesn’t end once we’ve trained a machine learning model. Understanding how well that model performs is one of the most critical and often insightful phases. This is where model evaluation comes into play. It’s our rigorous method for assessing whether our model delivers on its promise and, crucially, how it will behave in the real world.
“If you can’t measure it, you can’t manage it.” —Peter Drucker
Why evaluation matters
Imagine we’ve just spent weeks developing an intricate predictive model. How can we confidently say it’s ready for deployment without a robust evaluation? We can’t. Model evaluation is the bedrock of responsible machine learning practice. It allows us to:
Gauge reliability: How much can we trust our model’s predictions? Is it consistently accurate, or does it falter in certain situations?
Ensure generalization: This is paramount. A model might perform brilliantly on the data it was trained on, but it’s practically useless if it can’t extend that performance to new, unseen data. Evaluation helps us quantify its ability to generalize.
Facilitate comparison: How does our current model compare against a simpler baseline or alternative, more complex models? Evaluation metrics provide a standardized way to compare different approaches.
Inform decisions: Does the model’s performance meet the specific business or scientific objectives? Sometimes, even a high overall accuracy might not be enough if certain errors (e.g., missing a critical medical diagnosis) carry extremely high costs.
Diagnose issues: Evaluation metrics often serve as diagnostic tools. They can reveal if our model is overfitting (memorizing training data), underfitting (too simple to learn patterns), or exhibiting biases that must be addressed.
In essence, evaluation is our quality control. It prevents us from deploying ineffective or even detrimental models, ensuring our data-driven decisions are sound and impactful.
Evaluation metrics
When we evaluate machine learning models, the metrics we choose should align with the problem we’re tackling. Before diving into the details of each metric, we first need a clear way to mark every prediction as right or wrong. Let’s begin by exploring how to systematically tally the different outcomes a classification model can produce.
The confusion matrix
Imagine we’ve built a model to detect faces in photos. The model predicts whether the face is present for each image. We can compare these predictions to the actual reality (whether the face was in the photo). This comparison forms the basis of the confusion matrix.
The confusion matrix helps us break down the different types of correct and incorrect predictions the model makes:
Actual: Positive (Face Present) | Actual: Negative (Face Not Present) | |
Model Prediction: Positive (Face Present) | True positive (TP): The model correctly identifies that the face is present. | False positive (FP): The model incorrectly identifies that the face is present. |
Model Prediction: Negative (Face Not Present) | False negative (FN): The model incorrectly identifies that the face is not present. | True negative (TN): The model correctly identifies that the face is not present. |
Green cells (TP and TN) highlight the correct predictions.
Red cells (FP and FN) highlight the incorrect predictions.
TP and TN are the counts of samples identified correctly, while FN and FP are the samples identified incorrectly. These four values form the bedrock for calculating almost all other classification metrics.
Once we have our TP, TN, FP, and FN counts, our first instinct might be to examine the model’s overall accuracy.
Accuracy
Accuracy refers to the percentage of correct predictions out of the total predictions made by the model. It’s the simplest and most intuitive measure: “How often is the model right?”.
Accuracy is a good measure when the classes in the dataset are evenly distributed (e.g., roughly half the images contain the face, and half don’t). However, accuracy can be highly misleading in imbalanced datasets. For example, if face appears in 95% of the album pictures, a model that predicts “Face Present” for every image would achieve 95% accuracy. This model is useless for detecting the face, but it shows high accuracy because of the overwhelming majority class.
This begs the question: if not accuracy, then what? Luckily, we have several alternate options. Let’s examine some commonly used measures.
Precision, recall, and F1 score
For classification problems, especially with ...