Visual Search: Evaluation & Trade-Offs
Explore how to evaluate visual search systems using offline metrics such as Precision@K, Recall@K, NDCG, and MRR to assess retrieval quality. Understand online business impact metrics like click-through rate and purchase lift. Learn to design rigorous A/B tests, manage the trade-off between relevance and diversity, and tailor evaluation discussions to different interview levels.
Your serving architecture is live. ANN indexes are sharded, blue-green rebuilds keep embeddings fresh, and NSFW filtering guards the user experience. But none of that matters if you cannot answer one question every interviewer will ask: How do you know your visual search system is actually working?
Without rigorous evaluation, every design decision you made in previous lessons remains an unvalidated assumption. A common production failure mode illustrates why this matters: a team improves offline recall by 8%, ships the change, and watches click-through rate drop. The model retrieved more relevant items, but it killed result diversity. Users saw ten nearly identical black dresses instead of a varied set. Offline metrics said “better.” Users said “worse.”
This lesson covers the two evaluation planes every interviewer expects you to address: offline retrieval quality metrics and online business impact metrics. You will learn Precision@K, Recall@K, NDCG, MRR, user engagement metrics, A/B testing design, and how to calibrate your answer depth to L4, L5, or Staff+ expectations.
Offline retrieval quality metrics
Offline evaluation measures how well the retrieval system performs against a fixed, human-labeled dataset before any user ever sees the results. Four metrics form the standard toolkit.
Precision@K and Recall@K
Precision@K captures the fraction of the top-K retrieved results that are relevant to the query image. If a user uploads a photo of a mid-century modern chair and the system returns 10 results, 7 of which are relevant chairs, then