Advanced Experimentation Methods
Discover advanced methods in machine learning experimentation beyond standard A/B tests. Learn how interleaving accelerates ranking evaluations, multi-armed bandits adapt traffic dynamically, switchback experiments handle marketplace interference, and quasi-experimental designs enable causal inference without randomization, preparing you to select the best evaluation techniques for complex production constraints.
Standard A/B testing is the default method for online experimentation, but it relies on assumptions that many production ML settings do not satisfy. The previous lesson introduced an A/B testing framework covering hypothesis formation, power analysis, and failure modes such as novelty effects and network interference. Now consider four scenarios where standard A/B testing is not enough.
Ranking system evaluation at Netflix or Spotify often needs to detect tiny quality differences between rankers, requiring millions of users and weeks of runtime under a standard between-subject split. Dynamic optimization problems like ad placement lock traffic into fixed 50/50 splits, wasting revenue on a losing variant for the entire experiment duration. Marketplace systems like Uber or DoorDash share a common pool of drivers between treatment and control riders, creating interference that violates the independence assumption. Platform-wide policy changes or country-level launches make randomized user-level assignment impossible.
This lesson covers four advanced methods that solve these problems directly. Interleaving accelerates ranking evaluation. Multi-armed bandits dynamically optimize traffic allocation. Switchback experiments handle marketplace interference. Quasi-experimental designs enable causal inference without randomization. In an ML system design interview, reaching for the right method given a system’s constraints signals that you reason about the full deployment life cycle, not just model architecture.
Interleaving for ranking evaluation
How interleaving works
Interleaving is a within-subject experimental design where each user sees results from both ranking models merged into a single list. User interactions such as clicks, views, watch time, or purchases are used to infer which ranker contributed the items users engaged with. This differs from standard A/B testing, which compares different user groups, where each user is assigned to one ranker or treatment during the experiment.
The statistical efficiency gain comes from a simple insight. In a between-subject test, observed differences in click-through rate reflect both the true ranker quality difference and the enormous variance across individual users. Some users click on everything; others click on almost nothing. Interleaving eliminates this user-level variance because each user acts as their own control. Research from Netflix and Microsoft has demonstrated that interleaving can detect ranking differences with roughly 100x fewer samples than a standard A/B test.
Note: The 100x efficiency gain is not a universal constant. It depends on the variance structure of your user population and the magnitude of the ranking quality difference, but the order-of-magnitude improvement is consistently observed in practice.
The Team Draft algorithm
The most common interleaving method is the ...