Defining Success: Business Metrics vs. ML Metrics

Explore how to distinguish business metrics like revenue and engagement from ML metrics such as precision and recall. Understand how to align these metrics effectively to ensure ML models deliver real business value. This lesson helps you reason through metric selection, validate proxies, and recognize failure modes in metric alignment for scalable ML system design.

We'll cover the following...

Business metrics in ML systems
ML metrics and what they measure
Aligning ML metrics to business goals
- The three-step alignment framework
  - Example A: Video recommendations and engagement
  - Example B: Fraud detection and revenue protection
When ML metrics give away business goals
Conclusion

A recommendation team at a streaming platform ships a new ranking model after months of development. Offline evaluation shows an 8% improvement in NDCG. The team celebrates. Two weeks after launch, subscriber retention drops by 1.5%. Leadership pulls the model. What went wrong? The model was optimized for a metric that did not reliably proxy the business outcome it was supposed to serve. This is the metric alignment problem, and it is the single most common disconnect exposed in ML system design interviews at L5 and above.

With the functional and non-functional requirements defined in the previous lesson, the next critical step is to specify what success looks like for the system. That specification lives in two distinct metric families: business metrics, which capture the outcomes the product exists to deliver, and ML metrics, which capture how well the model performs its prediction or ranking task. This lesson defines each family, shows how to map between them, and identifies the failure modes where that mapping breaks down.

Business metrics in ML systems

Business metrics measure the outcomes that product, leadership, and business stakeholders track to evaluate whether the system is delivering product or business value. They are usually measured at the population level over longer time horizons, such as days, weeks, or quarters, and help justify continued investment in the ML system. Several business metrics appear repeatedly in ML system design interviews, along with one composite metric that combines multiple business outcomes.

Revenue measures the direct monetary outcome of the system, expressed as revenue per user, average order value, or ad revenue per impression. Ads ranking, e-commerce search, and dynamic pricing systems all optimize toward revenue targets.
Engagement captures user interaction intensity through signals like clicks, time spent, and sessions per day. Feed ranking and video recommendation systems, such as YouTube or Instagram, treat engagement as their primary business objective.
Retention tracks whether users return over time, measured as Day-1, Day-7, or Day-30 retention rates, or inversely as churn rate. Subscription products and platforms with network effects depend heavily on retention.
NPS (Net Promoter Score) is a survey-based measure of user satisfaction and likelihood to recommend the product. It serves as a lagging indicator of product quality and is typically measured quarterly.
Customer lifetime value (CLV) is a composite metric that integrates revenue and retention into a single estimate of the total monetary value a user generates over their relationship with the product. Marketplaces and subscription businesses use CLV to evaluate long-term system impact.

Note: Business metrics are not owned by the ML team. They are set by product and leadership. The ML team’s job is to move them through model improvements validated by A/B testing.

The following table summarizes these metrics with their typical ML system contexts and measurement horizons.

Key Business Metrics for ML Systems

Business Metric	Definition	Typical ML System	Measurement Horizon	Example Target
Revenue	Monetary outcome	Ads ranking, e-commerce search	Weekly/Monthly	+2% revenue per session
Engagement	Interaction intensity	Feed ranking, video recommendations	Daily/Weekly	+5% average session duration
Retention	User return rate	Subscription platforms, social networks	7-day/30-day	Reduce 30-day churn by 1%
NPS	User satisfaction score	Any user-facing product	Quarterly	Maintain NPS above 50
CLV	Lifetime monetary value per user	Marketplace, subscription	Quarterly/Annual	+3% average CLV

With business metrics defined, the next step is understanding the ML metrics that models actually optimize during training and evaluation.

ML metrics and what they measure

ML metrics quantify model performance at the prediction or query level. They are computed during offline evaluation and monitored during A/B tests. Unlike business metrics, they are owned by the ML team and operate on individual predictions rather than population-level outcomes. Five ML metrics appear most frequently in system design interviews.

Precision is the fraction of positive predictions that are actually correct, calculated as $\text{Precision} = \frac{TP}{TP + FP}$ . It matters most when false positives are costly, such as in spam filtering or fraud detection where blocking a legitimate user has direct business consequences.
Recall is the fraction of actual positives that the model captures, calculated as $\text{Recall} = \frac{TP}{TP + FN}$ . It matters most when missing a positive is dangerous, such as in fraud detection or content moderation where a missed case causes direct harm.
AUC (Area under the ROC curve) measures the ranking quality of a binary classifier across all decision thresholds. A model with higher AUC assigns higher scores to positive examples more consistently. AUC It is widely used in CTR prediction and fraud scoring.
NDCG (Normalized discounted cumulative gain) evaluates ranked list quality by weighting relevance scores with a position-based discount. NDCG It is the standard offline metric for search and recommendation systems.
RMSE (Root mean squared error) measures the magnitude of prediction errors for regression tasks, calculated as $\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$ . It is used in ETA prediction, demand forecasting, and pricing systems.

These metrics are proxies. They approximate business value but do not guarantee it.

The diagram below illustrates the relationship between the two metric families and highlights where alignment gaps can emerge.

Understanding both families sets up the most important skill tested in senior-level interviews: reasoning about how to connect them.

Aligning ML metrics to business goals

This is where junior answers diverge from senior ones. A junior candidate picks an ML metric because it is standard for the task type. A senior candidate derives the ML metric from the business objective and validates the proxy relationship. The alignment process follows a three-step reasoning chain.

The three-step alignment framework

Each step builds on the previous one, forming a chain of reasoning that interviewers expect to hear articulated explicitly.

Step 1 (Identify the business objective): Ask what the product team cares about most. Is it revenue, engagement, retention, or user safety? This determines the direction of the entire design.
Step 2 (Select the ML metric that serves as the best proxy): Choose the measurable model-level quantity that most reliably correlates with the business objective. This is not always the most obvious metric for the task type.
Step 3 (Validate the proxy relationship): Determine whether improving this ML metric actually improves the business metric, or whether the two can diverge. A/B testing is the ultimate validation mechanism: offline ML metric gains must be confirmed against online business metric movement.

Practical tip: In an interview, state the business objective first, then explicitly say “I am choosing [ML metric] as a proxy because [reason].” This signals senior-level thinking to the interviewer.

Example A: Video recommendations and engagement

Consider a YouTube-style video recommendation system where the business goal is engagement measured as total watch time. The naive choice is to optimize CTR (click-through rate) using AUC on click prediction. But optimizing CTR leads the model to surface clickbait thumbnails and sensational titles. Users click more but watch less of each video, and total session duration drops. The better proxy is predicted watch time used directly as the ranking score. This aligns the ML optimization target with the actual business outcome.

Example B: Fraud detection and revenue protection

A payment company wants to reduce fraud losses, which is fundamentally a revenue protection objective. The ML metric is recall at a fixed precision threshold. Missing fraud (low recall) directly causes financial loss. But if precision drops too low, the system blocks legitimate transactions, which also harms revenue. The solution is to optimize recall subject to a precision floor, such as recall at 95% precision. This formulation captures the business trade-off inside the ML metric itself.

Continuous monitoring and recalibration matter here because business goals evolve. A subscription platform might shift its primary objective from user acquisition (engagement) to profitability (revenue per user), which changes which ML metric serves as the best proxy.

The following quiz tests your ability to reason through alignment scenarios:

With the alignment framework established, the next section examines what happens when the alignment breaks.

When ML metrics give away business goals

A serious failure mode in ML system design is not simply poor model performance. It is a model that improves an offline or proxy ML metric while hurting the business outcome it is meant to improve. Two realistic cases illustrate this pattern.

The engagement trap: A social feed team optimizes CTR using AUC on click prediction. The model learns to surface sensational and divisive content because it generates clicks. Short-term engagement rises. But over weeks, session frequency drops because users feel worse after using the product. The ML metric (AUC on clicks) improved. The business metric (retention) degraded. The proxy relationship broke because clicks and long-term satisfaction are not the same thing.
Recommendation diversity collapse: A music streaming service optimizes NDCG on listening completion. The model converges on a narrow set of safe, popular tracks because they have the highest completion rates. Offline NDCG rises. But users perceive the product as repetitive. NPS drops and churn increases. The ML metric captured only one dimension of the listening experience while ignoring variety, which users value.

Attention: These failures are not hypothetical. They happen at scale when teams treat ML metrics as ends rather than proxies. Always ask “what behavior does this metric incentivize?” before committing to it.

Both cases share a common pattern. The ML metric captures only a slice of the business objective. Optimizing that slice pushes the model toward behaviors that satisfy the metric while violating the broader intent. The solution involves defining secondary metrics that cap harm while allowing primary metric optimization, a technique the next lesson on metric guardrails and cannibalization will address in depth.

Conclusion

Business metrics like revenue, engagement, retention, NPS, and CLV define what success means for the product. ML metrics like precision, recall, AUC, NDCG, and RMSE define what the model optimizes during training and evaluation. The alignment between them determines whether model improvements translate to real-world value. Senior-level interview answers always start from the business objective, work backward to the ML metric, and validate the proxy relationship through A/B testing and continuous monitoring. Even with careful alignment, optimizing one metric can degrade another. The next lesson on metric guardrails and cannibalization addresses exactly how to detect and prevent that failure mode.

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Defining Success: Business Metrics vs. ML Metrics

Business metrics in ML systems

Key Business Metrics for ML Systems

ML metrics and what they measure

Aligning ML metrics to business goals

The three-step alignment framework

Example A: Video recommendations and engagement

Example B: Fraud detection and revenue protection

When ML metrics give away business goals

Conclusion