Offline Metrics by Task Type

Explore how to select appropriate offline evaluation metrics based on task type, including classification, ranking, regression, and generation. Understand calibration's role across these tasks and learn to align metric choices with business objectives to demonstrate system design maturity in interviews.

We'll cover the following...

Classification metrics
- Discrimination metrics
- Probability quality
Ranking and retrieval metrics
- Position-sensitive metrics
- Coverage and precision metrics
Regression and generation metrics
- Regression metrics
- Generation metrics
Calibration as a cross-cutting concern
Conclusion

When a candidate proposes a model in a system design interview, for example, a fraud detector for payments or a video ranking system, a common follow-up is: “How would you evaluate this offline?” Choosing a metric that does not match the task, or failing to justify why a metric fits the task, shows that the candidate has not connected model metrics to production behavior. The previous lesson explained that offline metrics act as the first filter for model candidates before online experimentation. This lesson defines the concrete metrics used in that first evaluation phase, organized by four common task families in ML system design interviews: classification, ranking and retrieval, regression, and generation. Calibration is a cross-cutting concern that applies whenever downstream systems rely on predicted probabilities or confidence scores.

Metric misalignment with business objectives is a primary failure mode. A model that maximizes the wrong offline metric may pass the offline gate yet fail online, wasting expensive A/B test slots and engineering cycles.

The following mindmap provides a visual taxonomy of every metric covered in this lesson, organized by task type:

With this taxonomy in view, let’s start with the most common task type you’ll encounter in interviews.

Classification metrics

Classification problems appear in nearly every ML system design scenario, from content moderation at Meta to spam filtering at Gmail. The four metrics below cover the spectrum from threshold-independent evaluation to probability quality assessment.

Discrimination metrics

AUC-ROC: This metric measures a model’s ability to discriminate between positive and negative classes across all possible thresholds. It works well when classes are roughly balanced and you need a threshold-independent comparison, such as evaluating content moderation models where both harmful and benign posts appear frequently.
AUC-PR (precision-recall curve area): When the positive class is rare, AUC-PR focuses evaluation on how well the model retrieves the minority class without being inflated by the large number of true negatives. Fraud detection at Stripe and spam filtering at Gmail default to this metric because AUC-ROC can appear misleadingly high under severe class ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Offline Metrics by Task Type

Classification metrics

Discrimination metrics