Offline Metrics by Task Type
Explore how to select appropriate offline evaluation metrics based on task type, including classification, ranking, regression, and generation. Understand calibration's role across these tasks and learn to align metric choices with business objectives to demonstrate system design maturity in interviews.
When a candidate proposes a model in a system design interview, for example, a fraud detector for payments or a video ranking system, a common follow-up is: “How would you evaluate this offline?” Choosing a metric that does not match the task, or failing to justify why a metric fits the task, shows that the candidate has not connected model metrics to production behavior. The previous lesson explained that offline metrics act as the first filter for model candidates before online experimentation. This lesson defines the concrete metrics used in that first evaluation phase, organized by four common task families in ML system design interviews: classification, ranking and retrieval, regression, and generation. Calibration is a cross-cutting concern that applies whenever downstream systems rely on predicted probabilities or confidence scores.
Metric misalignment with business objectives is a primary failure mode. A model that maximizes the wrong offline metric may pass the offline gate yet fail online, wasting expensive A/B test slots and engineering cycles.
The following mindmap provides a visual taxonomy of every metric covered in this lesson, organized by task type:
With this taxonomy in view, let’s start with the most common task type you’ll encounter in interviews.
Classification metrics
Classification problems appear in nearly every ML system design scenario, from content moderation at Meta to spam filtering at Gmail. The four metrics below cover the spectrum from threshold-independent evaluation to probability quality assessment.
Discrimination metrics
AUC-ROC: This metric measures a model’s ability to discriminate between positive and negative classes across all possible thresholds. It works well when classes are roughly balanced and you need a threshold-independent comparison, such as evaluating content moderation models where both harmful and benign posts appear frequently.
AUC-PR (precision-recall curve area): When the positive class is rare, AUC-PR focuses evaluation on how well the model retrieves the minority class without being inflated by the large number of true negatives. Fraud detection at Stripe and spam filtering at Gmail default to this metric because AUC-ROC can appear misleadingly high under severe class ...