Semantic Search: Serving & Trade-Offs
Explore the design of web-scale semantic search serving systems by understanding ANN index types, sharding and replication strategies, caching, and managing latency budgets. Learn to make trade-offs between recall, memory, and latency to meet strict production requirements while preparing clear, quantifiable interview answers.
With evaluation, versioning, and interleaving experiments in place, the challenge shifts to a harder problem: serving billions of vectors under strict latency constraints. This is where most interview candidates stumble. They can describe embeddings and loss functions, but when asked “How would you serve a semantic search system at web scale?”, the answer often lacks architectural depth.
A system like Google Search or Spotify must return results in under 200 ms end-to-end while searching over billions of embeddings. That constraint forces every architectural decision into a tight latency envelope. The serving pipeline that makes this possible has three stages: query understanding (encoding the raw query into a dense vector), ANN retrieval (finding approximate nearest neighbors across a massive index), and cross-encoder re-ranking (applying a more expensive model to refine the top candidates).
This lesson walks through ANN index architecture with HNSW and IVF-PQ, sharding and replication strategies that make billion-scale search feasible, caching at multiple levels, latency budget allocation across pipeline stages, and a leveled answer comparison showing how L4, L5, and Staff+ candidates should approach this design differently.
ANN index architecture for web scale
Brute-force nearest neighbor search computes the distance between a query vector and every vector in the corpus. At billion-scale with 768-dimensional embeddings, that means roughly
Two index families dominate production systems, and understanding their trade-offs is essential for any system design interview.
HNSW and IVF-PQ
HNSW (Hierarchical Navigable Small World): This graph-based index constructs a multi-layer navigable graph where each node represents a vector. During search, the algorithm greedily traverses from coarse upper layers to fine lower layers, converging on nearest neighbors. HNSW achieves high recall, often 95% or above, but stores full float32 vectors in memory, resulting in a large RAM footprint. It works best when recall is paramount and the memory budget can accommodate the full corpus.
IVF-PQ (Inverted File with Product Quantization): This partition-based index first clusters vectors into
. ...Voronoi cells Regions in vector space where every point is closer to a given cluster centroid than to any other centroid, forming a spatial partition of the embedding space.