Semantic Search: Serving & Trade-Offs

Explore the design of web-scale semantic search serving systems by understanding ANN index types, sharding and replication strategies, caching, and managing latency budgets. Learn to make trade-offs between recall, memory, and latency to meet strict production requirements while preparing clear, quantifiable interview answers.

We'll cover the following...

ANN index architecture for web scale
- HNSW and IVF-PQ
Sharding, replication, and caching
- Sharding strategies
- Replication and caching
Latency budget allocation
- Tuning re-ranking depth
L4, L5, and Staff+ answer comparison
Conclusion

With evaluation, versioning, and interleaving experiments in place, the challenge shifts to a harder problem: serving billions of vectors under strict latency constraints. This is where most interview candidates stumble. They can describe embeddings and loss functions, but when asked “How would you serve a semantic search system at web scale?”, the answer often lacks architectural depth.

A system like Google Search or Spotify must return results in under 200 ms end-to-end while searching over billions of embeddings. That constraint forces every architectural decision into a tight latency envelope. The serving pipeline that makes this possible has three stages: query understanding (encoding the raw query into a dense vector), ANN retrieval (finding approximate nearest neighbors across a massive index), and cross-encoder re-ranking (applying a more expensive model to refine the top candidates).

This lesson walks through ANN index architecture with HNSW and IVF-PQ, sharding and replication strategies that make billion-scale search feasible, caching at multiple levels, latency budget allocation across pipeline stages, and a leveled answer comparison showing how L4, L5, and Staff+ candidates should approach this design differently.

ANN index architecture for web scale

Brute-force nearest neighbor search computes the distance between a query vector and every vector in the corpus. At billion-scale with 768-dimensional embeddings, that means roughly $10^9$ dot products per query, which is computationally infeasible for real-time serving. Approximate Nearest Neighbor (ANN) indices solve this by trading a small amount of recall for orders-of-magnitude speedup.

Two index families dominate production systems, and understanding their trade-offs is essential for any system design interview.

HNSW and IVF-PQ

HNSW (Hierarchical Navigable Small World): This graph-based index constructs a multi-layer navigable graph where each node represents a vector. During search, the algorithm greedily traverses from coarse upper layers to fine lower layers, converging on nearest neighbors. HNSW achieves high recall, often 95% or above, but stores full float32 vectors in memory, resulting in a large RAM footprint. It works best when recall is paramount and the memory budget can accommodate the full corpus.
IVF-PQ (Inverted File with Product Quantization): This partition-based index first clusters vectors into Voronoi cellsRegions in vector space where every point is closer to a given cluster centroid than to any other centroid, forming a spatial partition of the embedding space.. ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Semantic Search: Serving & Trade-Offs

ANN index architecture for web scale

HNSW and IVF-PQ