Search⌘ K
AI Features

A/B Testing Design

Explore the key elements of designing A/B tests in machine learning evaluation. Understand how to form precise hypotheses, select primary and guardrail metrics, plan experiment size and runtime, and avoid pitfalls like novelty effects, network interference, and peeking. This lesson equips you to design rigorous tests that validate model improvements in production and communicate the process effectively in interviews.

Offline metrics like AUC-PR, NDCG, and log loss tell you how well a model performs on historical data. They are proxies. A model that improves NDCG by 2% on a held-out test set might produce no measurable change in user behavior once deployed. Worse, it might degrade the experience in ways the offline evaluation never captured. The gap between offline promise and online reality is why every major ML organization, including Google, Meta, Netflix, and Uber, treats A/B testing as the final gate before any model change reaches production.

In ML system design interviews, candidates who only mention offline metrics miss the online validation step. You should be able to explain how to test whether an offline metric improvement leads to a measurable lift in a business KPI. This lesson covers the main steps in that process, including hypothesis formation, metric selection, sample-size planning, and common pitfalls that can invalidate production experiments.

Hypothesis formation and metric selection

Every A/B test begins with a falsifiable hypothesisA specific, testable statement that predicts a measurable outcome and can be proven wrong by experimental data.. This hypothesis links a concrete system change to an expected metric movement. Vague goals like “improve the user experience” are not testable. A well-formed hypothesis looks like this:

“Replacing the pointwise ranker with a listwise ranker in YouTube’s recommendation system will increase average watch time per session by at least 1%.”

This statement identifies the change (listwise ranker), the metric (watch time per session), and the threshold (1%). If the experiment shows less than 1% improvement, the hypothesis is rejected.

Primary metrics vs. guardrail metrics

Once the hypothesis is set, the next step is selecting the metrics the experiment will track. These fall into two categories:

  • Primary metrics are the direct target of the experiment. They represent the outcome the system change is designed to improve, such as click-through rate, revenue per session, or booking rate.

  • Guardrail metrics are safety constraints that must not degrade during the experiment. A model that boosts engagement but doubles serving latency would be rejected in production, regardless of how much the primary metric improved.

Consider Airbnb’s search ranking system. The primary metric is booking rate. But the team also monitors page load time and host cancellation rate as guardrails. If a new ranking model increases bookings by 2% but causes page load time to spike, the experiment fails.

Practical tip: In interviews, always define at least two guardrail metrics, one for user experience and one for system health. This signals production maturity.

The following table illustrates how primary and guardrail metrics map across two real-world systems.

Experiment Metric Types and Examples

Metric Type

Definition

Example (YouTube Recommendations)

Example (Uber ETA)

Primary Metric

The metric the experiment is designed to move

Watch time per session

Trip completion rate

Guardrail Metric (User Experience)

Must not degrade

App crash rate

Rider cancellation rate

Guardrail Metric (System Health)

Infrastructure constraint

Recommendation serving latency p99

ETA model inference latency

Guardrail Metric (Business)

Financial safety rail

Ad revenue per session

Driver earnings per hour

With metrics defined, the next question becomes how large the experiment needs to be and how long it must ...