Search⌘ K
AI Features

Neural Architectures for Ranking and Retrieval

Understand and apply dominant neural architectures for ranking and retrieval in machine learning systems. Learn how two-tower models efficiently retrieve candidates from billions, while Wide & Deep and DCN architectures rank shortlisted items effectively. Gain insights into their deployment, training strategies, and trade-offs to design scalable recommendation and search systems.

Logistic regression and gradient-boosted decision trees provide strong baselines for tabular ranking problems, but they hit a ceiling when a system must retrieve relevant items from a corpus of millions or billions of candidates. These classical models operate on hand-crafted features and cannot learn the dense, semantic representations needed to match users to items at a massive scale. Every major recommendation and search system at MAANG companies addresses this through a two-stage paradigm. A retrieval stage narrows billions of candidates down to hundreds using lightweight models, and a ranking stage applies expressive models to score and reorder that shortlist. This lesson covers the three dominant neural architectures interviewers expect you to diagram and justify: two-tower models for retrieval, Wide & Deep for ranking, and Deep & Cross Network (DCN) for ranking.

Consider this interview prompt: “Design a video recommendation system that serves 2 billion users with sub-200 ms request latency.” The architectures in this lesson give you the core building blocks for that design. You need a retrieval model fast enough to retrieve candidates from a billion-scale video index within the latency budget and a ranking model expressive enough to score shortlisted candidates using CTR and other engagement or quality signals. The following sections explain each architecture and where it fits in the retrieval-to-ranking pipeline.

Two-tower models for candidate retrieval

The two-tower model, also called a dual encoder or bi-encoder, is the industry standard for the retrieval stage. The core idea is straightforward: train two separate neural networks, one for users and one for items, so that each produces a dense embedding vector of the same dimensionality.

Architecture and training

The user tower takes in user features such as watch history, demographics, and contextual signals like time of day, passes them through an embedding layer and two to three fully connected layers, and outputs a fixed-size user embedding vector. The item tower follows the same structure but ingests item features like item ID, title embedding, category, and popularity signals.

During training, the model maximizes similarity (via dot product or cosine) between positive user-item pairs while pushing apart negatives. The loss function is typically sampled softmaxA training loss that approximates the full softmax over all items by sampling a subset of negatives, making training tractable when the item corpus contains millions of entries. or in-batch negatives, where other items in the same mini-batch serve as negative examples.

Attention: If the two towers share no parameters and are trained with only random negatives, the embedding space degrades, and retrieval recall drops significantly. Hard negative mining, which uses items that are similar but not relevant, is essential for production-quality two-tower models.
...