Semantic Search: Evaluation & Indexing

Explore how to evaluate semantic search quality using offline metrics like MRR and NDCG, monitor online performance with user behavior signals, and manage embedding versioning with zero-downtime reindexing. Understand safe update pipelines and interleaving experiments to maintain and improve search relevance in production environments.

We'll cover the following...

Offline evaluation metrics
- MRR and NDCG
Online metrics and observability
- Canary and satisfaction signals
Embedding versioning and index rebuild
- Zero-downtime rebuild pipeline
Interleaving for search relevance testing
- How interleaving works
Bridging to serving and trade-offs

With the retrieval architecture in place, including dual encoder, hybrid fusion, and cross-encoder re-ranker, every interviewer will pivot to two follow-up questions. How do you know the system is working? And how do you safely update it? These questions separate candidates who can sketch an architecture from those who can operate one. The stakes are concrete. Consider embedding drift, where a new model version maps the same text to a different region of the vector space, making existing document vectors incompatible with new query vectors. This failure mode is uniquely dangerous because it produces plausible-looking results rather than obvious errors, meaning no HTTP 500s, no empty pages, just silently degraded relevance for millions of queries. Imagine updating the embedding model behind Airbnb’s listing search; a naive swap could corrupt results for every active session without triggering a single traditional alert. This lesson covers three pillars that address these risks: offline and online evaluation metrics, embedding versioning with zero-downtime index rebuilds, and interleaving as the gold-standard online experiment method for search relevance changes.

Offline evaluation metrics

Measuring search quality starts with static, labeled datasets evaluated before any model reaches production. Two ranking metrics dominate this space, and understanding when to use each is a frequent interview differentiator.

MRR and NDCG

MRR (Mean Reciprocal Rank): If a user issues a navigational query like “Stripe API docs” and the first relevant result appears at position 3, that query contributes $\frac{1}{3}$ ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Semantic Search: Evaluation & Indexing

Offline evaluation metrics

MRR and NDCG