Search⌘ K
AI Features

Video Recommendation: Candidate Retrieval Architecture

Explore the candidate retrieval architecture in video recommendation systems using the two-tower model. Understand how decoupled user and video embeddings enable sub-millisecond retrieval over billions of videos. Learn training methods with contrastive loss, negative sampling strategies, and ANN indexing trade-offs, culminating in an efficient serving pipeline that delivers relevant candidate videos quickly.

When a user opens a video app, the system has less than ten milliseconds to sift through a catalog of over a billion videos and surface a few hundred worth scoring. Brute-force computation of a relevance score for every single item is out of the question. The candidate retrieval stage exists to solve exactly this problem, acting as the first and widest funnel in the recommendation pipeline. It aggressively narrows the search space from billions down to a manageable candidate set, all within single-digit millisecond latency.

This lesson builds directly on the content embeddings and ANN index infrastructure established previously. Here, the focus shifts to the architecture that learns those embeddings jointly for users and videos. In ML system design interviews at companies like YouTube, TikTok, and Netflix, interviewers expect you to articulate why decoupled encoding through a two-tower model is the key enabler of real-time retrieval at scale. Three core topics structure this lesson: the two-tower model architecture, negative sampling strategies (in-batch and hard negatives) for contrastive training, and ANN indexing trade-offs that govern real-time serving.

Two-tower model architecture

The two-tower model (also called a dual encoder)An architecture that uses two separate neural networks to independently encode queries and items into a shared embedding space, enabling efficient similarity-based retrieval. is the backbone of modern candidate retrieval systems. Rather than scoring every user-video pair through a single monolithic network, it splits the problem into two independent encoding paths.

User tower and video tower

The user tower is a neural network that consumes user-specific features and outputs a fixed-dimensional embedding vector, typically between 128 and 256 dimensions. Its input features include the following:

  • Watch history embeddings: A sequence of embeddings representing recently watched videos, often aggregated through mean pooling or a lightweight attention layer.

  • Demographic features: Attributes such as age bucket, country, and language preference.

  • Contextual signals: Real-time context like device type, time of day, and day of week.

The video tower is a separate neural network that processes video-specific features and produces an embedding of the same dimensionality. Its inputs include:

  • Content embeddings: Dense representations from pretrained vision or multimodal models that capture the semantic content of the video.

  • Metadata features: Categorical attributes such as ...