A/B Testing Design

Explore the key elements of designing A/B tests in machine learning evaluation. Understand how to form precise hypotheses, select primary and guardrail metrics, plan experiment size and runtime, and avoid pitfalls like novelty effects, network interference, and peeking. This lesson equips you to design rigorous tests that validate model improvements in production and communicate the process effectively in interviews.

We'll cover the following...

Hypothesis formation and metric selection
- Primary metrics vs. guardrail metrics
Experiment sizing and runtime
- Statistical power analysis
- Minimum detectable effect and sample size
  - Runtime estimation
Common pitfalls in A/B testing
Putting it together for interviews
Conclusion

Offline metrics like AUC-PR, NDCG, and log loss tell you how well a model performs on historical data. They are proxies. A model that improves NDCG by 2% on a held-out test set might produce no measurable change in user behavior once deployed. Worse, it might degrade the experience in ways the offline evaluation never captured. The gap between offline promise and online reality is why every major ML organization, including Google, Meta, Netflix, and Uber, treats A/B testing as the final gate before any model change reaches production.

In ML system design interviews, candidates who only mention offline metrics miss the online validation step. You should be able to explain how to test whether an offline metric improvement leads to a measurable lift in a business KPI. This lesson covers the main steps in that process, including hypothesis formation, metric selection, sample-size planning, and common pitfalls that can invalidate production experiments.

Hypothesis formation and metric selection

Every A/B test begins with a falsifiable hypothesisA specific, testable statement that predicts a measurable outcome and can be proven wrong by experimental data.. This hypothesis links a concrete system change to an expected metric movement. Vague goals like “improve the user experience” are not testable. A well-formed hypothesis looks like this:

“Replacing the pointwise ranker with a listwise ranker in YouTube’s recommendation system will increase average watch time per session by at least 1%.”

This statement identifies the change (listwise ranker), the metric (watch time per session), and the threshold (1%). If the experiment shows less than 1% improvement, the hypothesis is rejected.

Primary metrics vs. guardrail metrics

Once the hypothesis is set, the next step is selecting the metrics the experiment will track. These fall into two categories:

Primary metrics are the direct target of the experiment. They represent the outcome the system change is designed to improve, such as click-through rate, revenue per session, or booking rate.
Guardrail metrics are safety constraints that must not degrade during the experiment. A model that boosts engagement but doubles serving latency would be rejected in production, regardless of how much the primary metric improved.

Consider Airbnb’s search ranking system. The primary metric is booking rate. But the team also monitors page load time and host cancellation rate as guardrails. If a new ranking model increases bookings by 2% but causes page load time to spike, the experiment fails.

Practical tip: In interviews, always define at least two guardrail metrics, one for user experience and one for system health. This signals production maturity.

The following table illustrates how primary and guardrail metrics map across two real-world systems.

Experiment Metric Types and Examples

Metric Type	Definition	Example (YouTube Recommendations)	Example (Uber ETA)
Primary Metric	The metric the experiment is designed to move	Watch time per session	Trip completion rate
Guardrail Metric (User Experience)	Must not degrade	App crash rate	Rider cancellation rate
Guardrail Metric (System Health)	Infrastructure constraint	Recommendation serving latency p99	ETA model inference latency
Guardrail Metric (Business)	Financial safety rail	Ad revenue per session	Driver earnings per hour

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

A/B Testing Design

Hypothesis formation and metric selection

Primary metrics vs. guardrail metrics

Experiment Metric Types and Examples