Visual Search: Data Strategy & Embedding Generation
Explore how to design the data and embedding pipeline for large-scale visual search systems. Understand trade-offs in embedding architecture choices, multi-modal re-ranking, deduplication, and upstream content filtering. Gain skills to align system design with latency, resource, and quality constraints in ML interviews.
With the problem formulation locked down and the latency budget decomposed from the previous lesson, every downstream decision in your visual search system now hinges on two questions: what data enters the pipeline, and how that data is transformed into embeddings. Picture a concrete interview scenario where you are asked to design the data and embedding layer for a Pinterest Lens–style system indexing billions of images. The quality of your embedding space determines retrieval recall, and the cleanliness of your data pipeline determines whether that recall is trustworthy.
This lesson covers four interconnected design decisions. You will select an embedding architecture by reasoning through trade-offs rather than picking a “best” model. You will design multi-modal signals that combine visual and text metadata for re-ranking. You will build a deduplication stage that protects result diversity at billion scale. And you will place NSFW and policy filtering upstream in the pipeline as a safety gate, not a downstream patch.
At L5+ interviews, candidates are expected to justify each pipeline stage with trade-off reasoning. Simply listing components is not enough.
Embedding architecture comparison
Choosing an embedding architecture is the single highest-leverage decision in the data layer. The architecture determines the structure of your vector space, which in turn controls what your ANN index can and cannot retrieve. Three dominant options exist for visual search, and each carries a distinct trade-off profile.
ResNet (CNN-based)
ResNet leverages convolutional layers that encode strong
The limitation is generalization. ResNet embeddings trained on one visual domain (say, fashion) struggle when the query distribution shifts to home décor or food. There is also no native text understanding, so cross-modal retrieval requires a separate text encoder and a learned alignment ...