Search⌘ K
AI Features

Semantic Search: Serving & Trade-Offs

Explore the design of web-scale semantic search serving systems by understanding ANN index types, sharding and replication strategies, caching, and managing latency budgets. Learn to make trade-offs between recall, memory, and latency to meet strict production requirements while preparing clear, quantifiable interview answers.

With evaluation, versioning, and interleaving experiments in place, the challenge shifts to a harder problem: serving billions of vectors under strict latency constraints. This is where most interview candidates stumble. They can describe embeddings and loss functions, but when asked “How would you serve a semantic search system at web scale?”, the answer often lacks architectural depth.

A system like Google Search or Spotify must return results in under 200 ms end-to-end while searching over billions of embeddings. That constraint forces every architectural decision into a tight latency envelope. The serving pipeline that makes this possible has three stages: query understanding (encoding the raw query into a dense vector), ANN retrieval (finding approximate nearest neighbors across a massive index), and cross-encoder re-ranking (applying a more expensive model to refine the top candidates).

This lesson walks through ANN index architecture with HNSW and IVF-PQ, sharding and replication strategies that make billion-scale search feasible, caching at multiple levels, latency budget allocation across pipeline stages, and a leveled answer comparison showing how L4, L5, and Staff+ candidates should approach this design differently.

ANN index architecture for web scale

Brute-force nearest neighbor search computes the distance between a query vector and every vector in the corpus. At billion-scale with 768-dimensional embeddings, that means roughly 10910^9 dot products per query, which is computationally infeasible for real-time serving. Approximate Nearest Neighbor (ANN) indices solve this by trading a small amount of recall for orders-of-magnitude speedup.

Two index families dominate production systems, and understanding their trade-offs is essential for any system design interview.

HNSW and IVF-PQ

  • HNSW (Hierarchical Navigable Small World): This graph-based index constructs a multi-layer navigable graph where each node represents a vector. During search, the algorithm greedily traverses from coarse upper layers to fine lower layers, converging on nearest neighbors. HNSW achieves high recall, often 95% or above, but stores full float32 vectors in memory, resulting in a large RAM footprint. It works best when recall is paramount and the memory budget can accommodate the full corpus.

  • IVF-PQ (Inverted File with Product Quantization): This partition-based index first clusters vectors into Voronoi cellsRegions in vector space where every point is closer to a given cluster centroid than to any other centroid, forming a spatial partition of the embedding space.. ...