Search⌘ K
AI Features

Semantic Search: Evaluation & Indexing

Explore how to evaluate semantic search quality using offline metrics like MRR and NDCG, monitor online performance with user behavior signals, and manage embedding versioning with zero-downtime reindexing. Understand safe update pipelines and interleaving experiments to maintain and improve search relevance in production environments.

With the retrieval architecture in place, including dual encoder, hybrid fusion, and cross-encoder re-ranker, every interviewer will pivot to two follow-up questions. How do you know the system is working? And how do you safely update it? These questions separate candidates who can sketch an architecture from those who can operate one. The stakes are concrete. Consider embedding drift, where a new model version maps the same text to a different region of the vector space, making existing document vectors incompatible with new query vectors. This failure mode is uniquely dangerous because it produces plausible-looking results rather than obvious errors, meaning no HTTP 500s, no empty pages, just silently degraded relevance for millions of queries. Imagine updating the embedding model behind Airbnb’s listing search; a naive swap could corrupt results for every active session without triggering a single traditional alert. This lesson covers three pillars that address these risks: offline and online evaluation metrics, embedding versioning with zero-downtime index rebuilds, and interleaving as the gold-standard online experiment method for search relevance changes.

Offline evaluation metrics

Measuring search quality starts with static, labeled datasets evaluated before any model reaches production. Two ranking metrics dominate this space, and understanding when to use each is a frequent interview differentiator.

MRR and NDCG

MRR (Mean Reciprocal Rank): If a user issues a navigational query like “Stripe API docs” and the first relevant result appears at position 3, that query contributes 13\frac{1}{3} ...