Exploration vs. Exploitation

Explore the exploration-exploitation trade-off in machine learning systems for content ranking and recommendations. Learn how bandit algorithms such as epsilon-greedy, UCB, and Thompson Sampling balance showing high-confidence items versus exploring uncertain options to improve long-term system performance. Understand contextual bandits for personalization and the distinction between bandits and full reinforcement learning, equipping you to design adaptive, production-ready ML models.

We'll cover the following...

Three bandit algorithms and their trade-offs
Contextual bandits for personalization
- From arms to policies
  - Reward modeling and off-policy evaluation
When RL is the right tool vs. bandits
Summary

Many production ML systems that serve content must make an exploration-exploitation trade-off during ranking or serving. Should the system show an item with strong historical performance, or allocate exposure to an item with more uncertain expected performance that could improve long-term ranking quality by reducing uncertainty? This tension between exploiting high-confidence candidates and exploring lower-confidence candidates is a central design decision in ads, recommendations, and feed ranking systems. In senior-level interviews, explaining this trade-off clearly shows that you understand ranking systems beyond offline model accuracy.

Having covered how to compress and efficiently serve models in the previous lesson, we now shift to a fundamentally different production challenge. The model is deployed and fast, but what should it show? A system that always serves the highest-estimated-CTR ad or the top-ranked recommendation maximizes short-term reward. But it also starves new content of data, creating a feedback loop where popular items get shown more, accumulate more positive signals, and crowd out potentially superior alternatives that never had a chance.

Most candidates default to proposing static A/B tests for this problem. Production systems at Meta, Google, and Netflix go much further. They bake continuous, adaptive exploration directly into the serving path. The formal objective that governs this process is regret minimizationThe cumulative difference between the reward you would have earned by always choosing the true best option and the reward you actually earned by exploring.. A good exploration strategy keeps cumulative regret sublinear, meaning the per-decision cost of exploration shrinks over time.

This lesson covers three bandit algorithms with their trade-offs, contextual bandits for personalization, and the critical boundary between bandits and full reinforcement learning.

The following diagram illustrates how the explore-exploit spectrum plays out in a real serving system:

With this tension clearly framed, let’s examine the three algorithms that production systems use to navigate it.

Three bandit algorithms and their trade-offs

The multi-armed bandit formulationA decision framework where a system faces K options (called arms), each with an unknown reward distribution, and must choose one arm per request, observing a reward such as a click or conversion after each choice. is a decision framework where a system faces K options (called arms), each with an unknown reward distribution, and must choose one arm per request, observing a reward such as a click or conversion after each choice. A common analogy is a multi-armed bandit: a system chooses among actions with unknown reward distributions and tries to maximize cumulative reward over repeated decisions.

Epsilon-greedy

The simplest approach splits decisions into two modes. With probability $\epsilon$ (typically 0.05 to 0.1), the system selects a random arm. Otherwise, it exploits the arm with the highest observed average reward.

Implementation is trivial, which makes epsilon-greedy a strong prototyping baseline. The problem is that exploration is uniform. The system allocates the same exploration budget to an arm it has pulled 100,000 times (and knows is bad) as to an arm it has pulled 10 times (and knows almost nothing about). With a fixed $\epsilon$ , regret grows linearly with time. Decaying $\epsilon$ over time ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Exploration vs. Exploitation

Three bandit algorithms and their trade-offs

Epsilon-greedy