Video Recommendation: Candidate Retrieval Architecture
Explore the candidate retrieval architecture in video recommendation systems using the two-tower model. Understand how decoupled user and video embeddings enable sub-millisecond retrieval over billions of videos. Learn training methods with contrastive loss, negative sampling strategies, and ANN indexing trade-offs, culminating in an efficient serving pipeline that delivers relevant candidate videos quickly.
We'll cover the following...
When a user opens a video app, the system has less than ten milliseconds to sift through a catalog of over a billion videos and surface a few hundred worth scoring. Brute-force computation of a relevance score for every single item is out of the question. The candidate retrieval stage exists to solve exactly this problem, acting as the first and widest funnel in the recommendation pipeline. It aggressively narrows the search space from billions down to a manageable candidate set, all within single-digit millisecond latency.
This lesson builds directly on the content embeddings and ANN index infrastructure established previously. Here, the focus shifts to the architecture that learns those embeddings jointly for users and videos. In ML system design interviews at companies like YouTube, TikTok, and Netflix, interviewers expect you to articulate why decoupled encoding through a two-tower model is the key enabler of real-time retrieval at scale. Three core topics structure this lesson: the two-tower model architecture, negative sampling strategies (in-batch and hard negatives) for contrastive training, and ANN indexing trade-offs that govern real-time serving.
Two-tower model architecture
The
User tower and video tower
The user tower is a neural network that consumes user-specific features and outputs a fixed-dimensional embedding vector, typically between 128 and 256 dimensions. Its input features include the following:
Watch history embeddings: A sequence of embeddings representing recently watched videos, often aggregated through mean pooling or a lightweight attention layer.
Demographic features: Attributes such as age bucket, country, and language preference.
Contextual signals: Real-time context like device type, time of day, and day of week.
The video tower is a separate neural network that processes video-specific features and produces an embedding of the same dimensionality. Its inputs include:
Content embeddings: Dense representations from pretrained vision or multimodal models that capture the semantic content of the video.
Metadata features: Categorical attributes such as ...