Visual Search: Problem Framing & Requirements
Explore how to frame visual search system design problems accurately by distinguishing between image-to-image and image-to-product search. Understand key business metrics such as click-through rate, purchase conversion, and zero-result rate, and how they influence system design. Learn to manage billion-scale indexing within strict latency budgets and plan for operational risks like embedding drift. This lesson guides you to make precise scoping decisions essential for building scalable, efficient visual search systems.
Every time you open Pinterest and snap a photo of a lamp you like, or point Google Lens at a pair of sneakers on the street, a visual search system converts your photo into a mathematical representation, scans billions of indexed images, and returns relevant results, all before you finish blinking. This pipeline, spanning embedding generation, approximate nearest neighbor retrieval, multi-modal ranking, and strict latency enforcement, is exactly why interviewers at MAANG companies reach for visual search as a system design prompt. It tests breadth and depth simultaneously.
The core question you will face sounds deceptively simple: “Design a system where a user uploads a photo and receives visually similar or shoppable results in under 200 ms across a billion-image index.” Answering it well requires precise problem framing before any architecture diagram appears on the whiteboard.
This lesson walks through two distinct problem formulations, the business metrics that guide both offline and online evaluation, the scale and latency constraints that eliminate naive solutions, and a leveling comparison that reveals how scoping depth separates an L4 answer from a Staff+ answer. These framing decisions cascade into every downstream choice you will make in subsequent lessons.
Two problem formulations
A visual search query always starts with an image, but what the system returns, and how it is judged, depends entirely on which problem you are solving. Conflating the two formulations is one of the most common mistakes candidates make, and it leads to architectures that look reasonable on the surface but fail the business objective.
Image-to-image search
Pinterest Lens in discovery mode and Google Lens in explore mode both implement image-to-image search. The system retrieves images that are perceptually similar to the query. Relevance is judged by visual coherence: does the result share color palette, texture, composition, or scene structure with the query? The embedding model is trained with a contrastive visual loss that pulls visually similar pairs together in embedding space and pushes dissimilar pairs apart. The index corpus consists of web-crawled images spanning every visual category.
Image-to-product search
Amazon StyleSnap and Google Lens in shopping mode implement image-to-product search. The system retrieves purchasable catalog items that match the object depicted in the query photo. Relevance is judged by whether the user can ...