Visual Search: Model Architecture

Explore the design of visual search systems focusing on configuring CLIP encoders for retrieval, creating cross-modal re-ranking functions that combine visual and text relevance, and implementing attribute extraction heads for faceted filtering. Understand trade-offs in zero-shot versus fine-tuning strategies, latency-aware design, and scalable production deployment. This lesson equips you with practical architectural decisions for building robust, efficient visual search applications.

We'll cover the following...

CLIP encoder configuration for retrieval
- Zero-shot vs. fine-tuned retrieval
Cross-modal re-ranking design
- Scoring function architecture
  - MLP fusion
  - Gradient-boosted tree fusion
Attribute extraction for faceted filtering
- Multi-task classification heads
- Precomputation strategy
Preparing for serving and indexing

With the data pipeline, embedding architecture, deduplication, and NSFW filtering already designed, the next interview question lands squarely on model architecture. The interviewer wants to know how you move from raw CLIP embeddings to a production visual search system that retrieves, re-ranks, and filters results. Pinterest Lens uses CLIP-style encoders for visual search, and Google Lens layers attribute extraction on top of retrieval. Both systems face the same three architectural decisions this lesson resolves. First, how do you configure the CLIP encoder for retrieval, choosing between zero-shot and fine-tuned variants? Second, how do you design a cross-modal re-ranking scoring function that fuses visual similarity with text relevance? Third, how do you attach attribute extraction heads for faceted filtering without degrading retrieval quality? Interviewers at L5 and above expect candidates to reason about the modality gapThe systematic difference in cosine similarity between cross-modal pairs (image-to-text) vs. within-modality pairs (image-to-image), even in a well-trained dual-encoder model.. They also probe for awareness of alignment drift during fine-tuning and latency budgets when defending these choices.

CLIP encoder configuration for retrieval

CLIP’s contrastive learning objective trains an image encoder and a text encoder simultaneously, pulling matching image-text pairs together in a shared latent space while pushing non-matching pairs apart. This shared space enables zero-shot retrieval, where a query image can be matched against text descriptions (or vice versa) without any task-specific training.

However, even in a well-trained CLIP model, visual and textual embeddings cluster into distinct subspaces. Cosine similarity across modalities is systematically lower than within-modality similarity. This modality gap means that a query image’s nearest neighbors in the text embedding space may not reflect the most semantically relevant matches.

Zero-shot vs. fine-tuned retrieval

The choice between zero-shot and fine-tuned CLIP determines both retrieval quality and engineering cost. Each configuration occupies a different point on the accuracy-latency-cost frontier.

Zero-shot CLIP uses the pretrained model directly, requiring no domain-specific training. Deployment is fast, but retrieval quality suffers on specialized catalogs such as fashion or ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Visual Search: Model Architecture

CLIP encoder configuration for retrieval

Zero-shot vs. fine-tuned retrieval