Visual Search: Data Strategy & Embedding Generation

Explore how to design the data and embedding pipeline for large-scale visual search systems. Understand trade-offs in embedding architecture choices, multi-modal re-ranking, deduplication, and upstream content filtering. Gain skills to align system design with latency, resource, and quality constraints in ML interviews.

We'll cover the following...

Embedding architecture comparison
Multi-modal signals for re-ranking
- Designing the two-stage pipeline
Duplicate and near-duplicate handling
- Building the deduplication pipeline
NSFW and policy filtering as upstream gate
Preparing for model architecture design

With the problem formulation locked down and the latency budget decomposed from the previous lesson, every downstream decision in your visual search system now hinges on two questions: what data enters the pipeline, and how that data is transformed into embeddings. Picture a concrete interview scenario where you are asked to design the data and embedding layer for a Pinterest Lens–style system indexing billions of images. The quality of your embedding space determines retrieval recall, and the cleanliness of your data pipeline determines whether that recall is trustworthy.

This lesson covers four interconnected design decisions. You will select an embedding architecture by reasoning through trade-offs rather than picking a “best” model. You will design multi-modal signals that combine visual and text metadata for re-ranking. You will build a deduplication stage that protects result diversity at billion scale. And you will place NSFW and policy filtering upstream in the pipeline as a safety gate, not a downstream patch.

At L5+ interviews, candidates are expected to justify each pipeline stage with trade-off reasoning. Simply listing components is not enough.

Embedding architecture comparison

Choosing an embedding architecture is the single highest-leverage decision in the data layer. The architecture determines the structure of your vector space, which in turn controls what your ANN index can and cannot retrieve. Three dominant options exist for visual search, and each carries a distinct trade-off profile.

ResNet (CNN-based)

ResNet leverages convolutional layers that encode strong inductive biasesBuilt-in assumptions a model architecture makes about the structure of data, such as translation invariance in CNNs, which reduce the amount of training data needed to learn useful representations.. Translation invariance through convolutions means the network recognizes an object regardless of where it appears in the frame. Fine-tuning cost is low because pretrained ResNet checkpoints are widely available and converge quickly on domain-specific catalogs. Inference latency sits around 10 ms on a single GPU, making it attractive for latency-constrained mobile inference paths.

The limitation is generalization. ResNet embeddings trained on one visual domain (say, fashion) struggle when the query distribution shifts to home décor or food. There is also no native text understanding, so cross-modal retrieval requires a separate text encoder and a learned alignment ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Visual Search: Data Strategy & Embedding Generation

Embedding architecture comparison

ResNet (CNN-based)