Retrieval and Ranking Pipelines

Explore the design of multi-stage retrieval and ranking pipelines that efficiently narrow billions of items to the top results in under 200 milliseconds. Understand candidate generation with two-tower ANN architectures, trade-offs in ANN indexing, metadata filtering strategies, and the role of coarse and fine-grained ranking models. Gain insights into balancing recall, relevance, and latency for large-scale recommendation systems and how to communicate these concepts effectively in ML system design interviews.

We'll cover the following...

Candidate generation with two-tower ANN
- How the two towers work
- Why recall@k is the critical metric
ANN indexing and vector database design
- Core indexing structures
- Billion-scale design considerations
  - The metadata filtering problem
Coarse ranking and fine-grained re-ranking
- Coarse ranking
- Fine-grained re-ranking
Conclusion

When a user opens YouTube or scrolls through Instagram, the platform must select a handful of items from a corpus of hundreds of millions, sometimes billions, of candidates. The entire process completes in under 200 milliseconds. This is not a single model call. It is a carefully staged funnel where each layer narrows the candidate set while increasing model sophistication. With serving paradigms like synchronous inference and batch pre-computation established in the previous lesson, this lesson shows how those patterns combine inside the multi-stage funnel that powers search and recommendations at MAANG scale.

Consider a concrete interview prompt: “Design a video recommendation system that serves 1 billion users against 500 million videos with a p99 latency SLA of 200 ms.” No single model can score 500 million items per request within that budget. Instead, the system decomposes the problem into three stages: candidate generation, coarse ranking, and fine-grained re-ranking. Each stage has its own latency budget, model architecture, and failure modes. Interviewers expect you to articulate not just what each stage does, but why it exists and what breaks if you skip it.

The following diagram illustrates how the funnel progressively narrows the candidate set from billions to the final ranked slate.

Each stage in this funnel operates under strict constraints, and the design decisions at one stage ripple through every downstream component. The next sections unpack each stage in detail, starting with candidate generation.

Candidate generation with two-tower ANN

The first stage must reduce billions of items to roughly 500–2,000 candidates in under 10–20 milliseconds. The dominant approach uses a two-tower architectureA neural network design where separate encoder networks (towers) independently produce embeddings for queries and items, enabling offline pre-computation of item embeddings and fast online retrieval via approximate nearest neighbor search..

How the two towers work

The query tower ingests user context (user ID, session history, device type) and produces a dense embedding vector. The item tower ingests item features (item ID, category, metadata) and produces an embedding of the same dimensionality. Both towers are trained jointly so that relevant query-item pairs land close together ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Retrieval and Ranking Pipelines

Candidate generation with two-tower ANN

How the two towers work