Retrieval and Ranking Pipelines
Explore the design of multi-stage retrieval and ranking pipelines that efficiently narrow billions of items to the top results in under 200 milliseconds. Understand candidate generation with two-tower ANN architectures, trade-offs in ANN indexing, metadata filtering strategies, and the role of coarse and fine-grained ranking models. Gain insights into balancing recall, relevance, and latency for large-scale recommendation systems and how to communicate these concepts effectively in ML system design interviews.
When a user opens YouTube or scrolls through Instagram, the platform must select a handful of items from a corpus of hundreds of millions, sometimes billions, of candidates. The entire process completes in under 200 milliseconds. This is not a single model call. It is a carefully staged funnel where each layer narrows the candidate set while increasing model sophistication. With serving paradigms like synchronous inference and batch pre-computation established in the previous lesson, this lesson shows how those patterns combine inside the multi-stage funnel that powers search and recommendations at MAANG scale.
Consider a concrete interview prompt: “Design a video recommendation system that serves 1 billion users against 500 million videos with a p99 latency SLA of 200 ms.” No single model can score 500 million items per request within that budget. Instead, the system decomposes the problem into three stages: candidate generation, coarse ranking, and fine-grained re-ranking. Each stage has its own latency budget, model architecture, and failure modes. Interviewers expect you to articulate not just what each stage does, but why it exists and what breaks if you skip it.
The following diagram illustrates how the funnel progressively narrows the candidate set from billions to the final ranked slate.
Each stage in this funnel operates under strict constraints, and the design decisions at one stage ripple through every downstream component. The next sections unpack each stage in detail, starting with candidate generation.
Candidate generation with two-tower ANN
The first stage must reduce billions of items to roughly 500–2,000 candidates in under 10–20 milliseconds. The dominant approach uses a
How the two towers work
The query tower ingests user context (user ID, session history, device type) and produces a dense embedding vector. The item tower ingests item features (item ID, category, metadata) and produces an embedding of the same dimensionality. Both towers are trained jointly so that relevant query-item pairs land close together ...