Search⌘ K
AI Features

Defining Success: Business Metrics vs. ML Metrics

Explore how to distinguish business metrics like revenue and engagement from ML metrics such as precision and recall. Understand how to align these metrics effectively to ensure ML models deliver real business value. This lesson helps you reason through metric selection, validate proxies, and recognize failure modes in metric alignment for scalable ML system design.

A recommendation team at a streaming platform ships a new ranking model after months of development. Offline evaluation shows an 8% improvement in NDCG. The team celebrates. Two weeks after launch, subscriber retention drops by 1.5%. Leadership pulls the model. What went wrong? The model was optimized for a metric that did not reliably proxy the business outcome it was supposed to serve. This is the metric alignment problem, and it is the single most common disconnect exposed in ML system design interviews at L5 and above.

With the functional and non-functional requirements defined in the previous lesson, the next critical step is to specify what success looks like for the system. That specification lives in two distinct metric families: business metrics, which capture the outcomes the product exists to deliver, and ML metrics, which capture how well the model performs its prediction or ranking task. This lesson defines each family, shows how to map between them, and identifies the failure modes where that mapping breaks down.

Business metrics in ML systems

Business metrics measure the outcomes that product, leadership, and business stakeholders track to evaluate whether the system is delivering product or business value. They are usually measured at the population level over longer time horizons, such as days, weeks, or quarters, and help justify continued investment in the ML system. Several business metrics appear repeatedly in ML system design interviews, along with one composite metric that combines multiple business outcomes.

  • Revenue measures the direct monetary outcome of the system, expressed as revenue per user, average order value, or ad revenue per impression. Ads ranking, e-commerce search, and dynamic pricing systems all optimize toward revenue targets.

  • Engagement captures user interaction intensity through signals like clicks, time spent, and sessions per day. Feed ranking and video recommendation systems, such as YouTube or Instagram, treat engagement as their primary business objective.

  • Retention tracks whether users return over time, measured as Day-1, Day-7, or Day-30 retention rates, or inversely as churn rate. Subscription products and platforms with network effects depend heavily on retention.

  • NPS (Net Promoter Score) is a survey-based measure of user satisfaction and likelihood to recommend the product. It serves as a lagging indicator of product quality and is typically measured quarterly.

  • Customer lifetime value (CLV) is a composite metric that integrates revenue and retention into a single estimate of the total monetary value a user generates over their relationship with the product. Marketplaces and subscription businesses use CLV to evaluate long-term system impact.

Note: Business metrics are not owned by the ML team. They are set by product and leadership. The ML team’s job is to move them through model improvements validated by A/B testing.

The following table summarizes these metrics with their typical ML system contexts and measurement horizons.

Key Business Metrics for ML Systems

Business Metric

Definition

Typical ML System

Measurement Horizon

Example Target

Revenue

Monetary outcome

Ads ranking, e-commerce search

Weekly/Monthly

+2% revenue per session

Engagement

Interaction intensity

Feed ranking, video recommendations

Daily/Weekly

+5% average session duration

Retention

User return rate

Subscription platforms, social networks

7-day/30-day

Reduce 30-day churn by 1%

NPS

User satisfaction score

Any user-facing product

Quarterly

Maintain NPS above 50

CLV

Lifetime monetary value per user

Marketplace, subscription

Quarterly/Annual

+3% average CLV

With business metrics defined, the next step is understanding the ML metrics that models actually optimize during training and evaluation.

ML metrics and what they measure

ML metrics quantify model performance at the prediction or query level. They are computed during offline evaluation and monitored during A/B tests. Unlike business metrics, they are owned by the ML team and operate on individual predictions rather than population-level outcomes. Five ML metrics appear most frequently in system design interviews.

  • Precision is the fraction of positive predictions that are actually correct, calculated as Precision=TPTP+FP\text{Precision} = \frac{TP}{TP + FP}. It matters most when false positives are costly, such as in spam filtering or fraud detection where blocking a legitimate user has direct business consequences.

  • Recall is the fraction of actual positives that the model captures, calculated as Recall=TPTP+FN\text{Recall} = \frac{TP}{TP + FN}. It matters most when missing a positive is dangerous, such as in fraud detection or content moderation where a missed case causes direct harm.

  • AUC (Area under the ROC curve) measures the ranking quality of a binary classifier across all decision thresholds. A model with higher AUC assigns higher scores to positive examples more consistently. AUC It is widely used in CTR prediction and fraud scoring.

  • NDCG (Normalized discounted cumulative gain) evaluates ranked list quality by weighting relevance scores with a position-based discount. NDCG It is the standard offline metric for search and recommendation systems.

  • RMSE (Root mean squared error) measures the magnitude of prediction errors for regression tasks, calculated as RMSE=1ni=1n(yiy^i)2\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}. It is used in ETA prediction, demand forecasting, and pricing systems.

These metrics are proxies. They approximate business value but do not guarantee it.

The diagram below illustrates the relationship between the two metric families and highlights where alignment gaps can emerge.

Business metrics to ML metrics alignment showing where mapping gaps cause system design failures
Business metrics to ML metrics alignment showing where mapping gaps cause system design failures

Understanding both families sets up the most important skill tested in senior-level interviews: reasoning about how to connect them.

Aligning ML metrics to business goals

This is where junior answers diverge from senior ones. A junior candidate picks an ML metric because it is standard for the task type. A senior candidate derives the ML metric from the business objective and validates the proxy relationship. The alignment process follows a three-step reasoning chain.

The three-step alignment framework

Each step builds on the previous one, forming a chain of reasoning that interviewers expect to hear articulated explicitly.

  • Step 1 (Identify the business objective): Ask what the product team cares about most. Is it revenue, engagement, retention, or user safety? This determines the direction of the entire design.

  • Step 2 (Select the ML metric that serves as the best proxy): Choose the measurable model-level quantity that most reliably correlates with the business objective. This is not always the most obvious metric for the task type.

  • Step 3 (Validate the proxy relationship): Determine whether improving this ML metric actually improves the business metric, or whether the two can diverge. A/B testing is the ultimate validation mechanism: offline ML metric gains must be confirmed against online business metric movement.

Practical tip: In an interview, state the business objective first, then explicitly say “I am choosing [ML metric] as a proxy because [reason].” This signals senior-level thinking to the interviewer.

Example A: Video recommendations and engagement

Consider a YouTube-style video recommendation system where the business goal is engagement measured as total watch time. The naive choice is to optimize CTR (click-through rate) using AUC on click prediction. But optimizing CTR leads the model to surface clickbait thumbnails and sensational titles. Users click more but watch less of each video, and total session duration drops. The better proxy is predicted watch time used directly as the ranking score. This aligns the ML optimization target with the actual business outcome.

Example B: Fraud detection and revenue protection

A payment company wants to reduce fraud losses, which is fundamentally a revenue protection objective. The ML metric is recall at a fixed precision threshold. Missing fraud (low recall) directly causes financial loss. But if precision drops too low, the system blocks legitimate transactions, which also harms revenue. The solution is to optimize recall subject to a precision floor, such as recall at 95% precision. This formulation captures the business trade-off inside the ML metric itself.

Continuous monitoring and recalibration matter here because business goals evolve. A subscription platform might shift its primary objective from user acquisition (engagement) to profitability (revenue per user), which changes which ML metric serves as the best proxy.

The following quiz tests your ability to reason through alignment scenarios:

Lesson Quiz

1.

A search ranking team improves NDCG@10 by 6% in offline evaluation but observes no revenue change during an A/B test. What is the most likely cause?

A.

NDCG is not a valid metric for search ranking tasks.

B.

The relevance labels used for training do not capture purchase intent.

C.

The A/B test sample size was insufficient.

D.

Revenue is not a valid business metric for search systems.


1 / 2

With the alignment framework established, the next section examines what happens when the alignment breaks.

When ML metrics give away business goals

A serious failure mode in ML system design is not simply poor model performance. It is a model that improves an offline or proxy ML metric while hurting the business outcome it is meant to improve. Two realistic cases illustrate this pattern.

  • The engagement trap: A social feed team optimizes CTR using AUC on click prediction. The model learns to surface sensational and divisive content because it generates clicks. Short-term engagement rises. But over weeks, session frequency drops because users feel worse after using the product. The ML metric (AUC on clicks) improved. The business metric (retention) degraded. The proxy relationship broke because clicks and long-term satisfaction are not the same thing.

  • Recommendation diversity collapse: A music streaming service optimizes NDCG on listening completion. The model converges on a narrow set of safe, popular tracks because they have the highest completion rates. Offline NDCG rises. But users perceive the product as repetitive. NPS drops and churn increases. The ML metric captured only one dimension of the listening experience while ignoring variety, which users value.

Attention: These failures are not hypothetical. They happen at scale when teams treat ML metrics as ends rather than proxies. Always ask “what behavior does this metric incentivize?” before committing to it.

Both cases share a common pattern. The ML metric captures only a slice of the business objective. Optimizing that slice pushes the model toward behaviors that satisfy the metric while violating the broader intent. The solution involves defining secondary metrics that cap harm while allowing primary metric optimization, a technique the next lesson on metric guardrails and cannibalization will address in depth.

Conclusion

Business metrics like revenue, engagement, retention, NPS, and CLV define what success means for the product. ML metrics like precision, recall, AUC, NDCG, and RMSE define what the model optimizes during training and evaluation. The alignment between them determines whether model improvements translate to real-world value. Senior-level interview answers always start from the business objective, work backward to the ML metric, and validate the proxy relationship through A/B testing and continuous monitoring. Even with careful alignment, optimizing one metric can degrade another. The next lesson on metric guardrails and cannibalization addresses exactly how to detect and prevent that failure mode.