A/B Testing Design
Explore the key elements of designing A/B tests in machine learning evaluation. Understand how to form precise hypotheses, select primary and guardrail metrics, plan experiment size and runtime, and avoid pitfalls like novelty effects, network interference, and peeking. This lesson equips you to design rigorous tests that validate model improvements in production and communicate the process effectively in interviews.
Offline metrics like AUC-PR, NDCG, and log loss tell you how well a model performs on historical data. They are proxies. A model that improves NDCG by 2% on a held-out test set might produce no measurable change in user behavior once deployed. Worse, it might degrade the experience in ways the offline evaluation never captured. The gap between offline promise and online reality is why every major ML organization, including Google, Meta, Netflix, and Uber, treats A/B testing as the final gate before any model change reaches production.
In ML system design interviews, candidates who only mention offline metrics miss the online validation step. You should be able to explain how to test whether an offline metric improvement leads to a measurable lift in a business KPI. This lesson covers the main steps in that process, including hypothesis formation, metric selection, sample-size planning, and common pitfalls that can invalidate production experiments.
Hypothesis formation and metric selection
Every A/B test begins with a
“Replacing the pointwise ranker with a listwise ranker in YouTube’s recommendation system will increase average watch time per session by at least 1%.”
This statement identifies the change (listwise ranker), the metric (watch time per session), and the threshold (1%). If the experiment shows less than 1% improvement, the hypothesis is rejected.
Primary metrics vs. guardrail metrics
Once the hypothesis is set, the next step is selecting the metrics the experiment will track. These fall into two categories:
Primary metrics are the direct target of the experiment. They represent the outcome the system change is designed to improve, such as click-through rate, revenue per session, or booking rate.
Guardrail metrics are safety constraints that must not degrade during the experiment. A model that boosts engagement but doubles serving latency would be rejected in production, regardless of how much the primary metric improved.
Consider Airbnb’s search ranking system. The primary metric is booking rate. But the team also monitors page load time and host cancellation rate as guardrails. If a new ranking model increases bookings by 2% but causes page load time to spike, the experiment fails.
Practical tip: In interviews, always define at least two guardrail metrics, one for user experience and one for system health. This signals production maturity.
The following table illustrates how primary and guardrail metrics map across two real-world systems.
Experiment Metric Types and Examples
Metric Type | Definition | Example (YouTube Recommendations) | Example (Uber ETA) |
Primary Metric | The metric the experiment is designed to move | Watch time per session | Trip completion rate |
Guardrail Metric (User Experience) | Must not degrade | App crash rate | Rider cancellation rate |
Guardrail Metric (System Health) | Infrastructure constraint | Recommendation serving latency p99 | ETA model inference latency |
Guardrail Metric (Business) | Financial safety rail | Ad revenue per session | Driver earnings per hour |
With metrics defined, the next question becomes how large the experiment needs to be and how long it must ...