Advanced Experimentation Methods

Discover advanced methods in machine learning experimentation beyond standard A/B tests. Learn how interleaving accelerates ranking evaluations, multi-armed bandits adapt traffic dynamically, switchback experiments handle marketplace interference, and quasi-experimental designs enable causal inference without randomization, preparing you to select the best evaluation techniques for complex production constraints.

We'll cover the following...

Interleaving for ranking evaluation
- How interleaving works
- The Team Draft algorithm
Multi-armed bandits as adaptive allocation
- The explore-exploit trade-off
- When bandits beat A/B tests
Switchback experiments for marketplaces
- The network interference problem
- How switchback designs work
Quasi-experimental methods
- Difference-in-differences
- Synthetic control
Conclusion

Standard A/B testing is the default method for online experimentation, but it relies on assumptions that many production ML settings do not satisfy. The previous lesson introduced an A/B testing framework covering hypothesis formation, power analysis, and failure modes such as novelty effects and network interference. Now consider four scenarios where standard A/B testing is not enough.

Ranking system evaluation at Netflix or Spotify often needs to detect tiny quality differences between rankers, requiring millions of users and weeks of runtime under a standard between-subject split. Dynamic optimization problems like ad placement lock traffic into fixed 50/50 splits, wasting revenue on a losing variant for the entire experiment duration. Marketplace systems like Uber or DoorDash share a common pool of drivers between treatment and control riders, creating interference that violates the independence assumption. Platform-wide policy changes or country-level launches make randomized user-level assignment impossible.

This lesson covers four advanced methods that solve these problems directly. Interleaving accelerates ranking evaluation. Multi-armed bandits dynamically optimize traffic allocation. Switchback experiments handle marketplace interference. Quasi-experimental designs enable causal inference without randomization. In an ML system design interview, reaching for the right method given a system’s constraints signals that you reason about the full deployment life cycle, not just model architecture.

Interleaving for ranking evaluation

How interleaving works

Interleaving is a within-subject experimental design where each user sees results from both ranking models merged into a single list. User interactions such as clicks, views, watch time, or purchases are used to infer which ranker contributed the items users engaged with. This differs from standard A/B testing, which compares different user groups, where each user is assigned to one ranker or treatment during the experiment.

The statistical efficiency gain comes from a simple insight. In a between-subject test, observed differences in click-through rate reflect both the true ranker quality difference and the enormous variance across individual users. Some users click on everything; others click on almost nothing. Interleaving eliminates this user-level variance because each user acts as their own control. Research from Netflix and Microsoft has demonstrated that interleaving can detect ranking differences with roughly 100x fewer samples than a standard A/B test.

Note: The 100x efficiency gain is not a universal constant. It depends on the variance structure of your user population and the magnitude of the ranking quality difference, but the order-of-magnitude improvement is consistently observed in practice.

The Team Draft algorithm

The most common interleaving method is the ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Advanced Experimentation Methods

Interleaving for ranking evaluation

How interleaving works

The Team Draft algorithm