Visual Search: Model Architecture
Explore the design of visual search systems focusing on configuring CLIP encoders for retrieval, creating cross-modal re-ranking functions that combine visual and text relevance, and implementing attribute extraction heads for faceted filtering. Understand trade-offs in zero-shot versus fine-tuning strategies, latency-aware design, and scalable production deployment. This lesson equips you with practical architectural decisions for building robust, efficient visual search applications.
With the data pipeline, embedding architecture, deduplication, and NSFW filtering already designed, the next interview question lands squarely on model architecture. The interviewer wants to know how you move from raw CLIP embeddings to a production visual search system that retrieves, re-ranks, and filters results. Pinterest Lens uses CLIP-style encoders for visual search, and Google Lens layers attribute extraction on top of retrieval. Both systems face the same three architectural decisions this lesson resolves. First, how do you configure the CLIP encoder for retrieval, choosing between zero-shot and fine-tuned variants? Second, how do you design a cross-modal re-ranking scoring function that fuses visual similarity with text relevance? Third, how do you attach attribute extraction heads for faceted filtering without degrading retrieval quality? Interviewers at L5 and above expect candidates to reason about the
CLIP encoder configuration for retrieval
CLIP’s contrastive learning objective trains an image encoder and a text encoder simultaneously, pulling matching image-text pairs together in a shared latent space while pushing non-matching pairs apart. This shared space enables zero-shot retrieval, where a query image can be matched against text descriptions (or vice versa) without any task-specific training.
However, even in a well-trained CLIP model, visual and textual embeddings cluster into distinct subspaces. Cosine similarity across modalities is systematically lower than within-modality similarity. This modality gap means that a query image’s nearest neighbors in the text embedding space may not reflect the most semantically relevant matches.
Zero-shot vs. fine-tuned retrieval
The choice between zero-shot and fine-tuned CLIP determines both retrieval quality and engineering cost. Each configuration occupies a different point on the accuracy-latency-cost frontier.
Zero-shot CLIP uses the pretrained model directly, requiring no domain-specific training. Deployment is fast, but retrieval quality suffers on specialized catalogs such as fashion or ...