Search⌘ K
AI Features

Video Recommendation: Ranking & Re-Ranking

Explore how to design precision ranking systems for video recommendations that balance multiple objectives like watch time, user satisfaction, freshness, and content diversity. Understand multi-head model architectures such as Wide & Deep and Deep & Cross Network, and implement re-ranking strategies for diversity, policy compliance, and fairness. This lesson equips you to create scalable, fair, and effective ranking systems in real-world scenarios.

The retrieval stage handed us a few hundred candidate videos from a two-tower ANN lookup optimized for recall across billions of items. Now the system must do something fundamentally different: score and order those candidates with high precision against multiple competing business and user objectives. This is the ranking stage, the precision layer of the recommendation funnel. The core interview question this lesson answers is direct and deceptively hard: how do you design a ranking system that simultaneously optimizes watch time, user satisfaction, content freshness, and diversity without letting any single objective dominate? YouTube’s production ranker faces exactly this challenge, and the design patterns it uses generalize across most large-scale recommendation systems. This lesson walks through multi-objective ranking formulation, the Wide & Deep and DCN architectures that power it, re-ranking for diversity and policy compliance, and fairness as a core design constraint.

Multi-objective ranking formulation

A single-score ranker that predicts only one thing, say expected watch time, will inevitably overfit to that signal. Users end up in autoplay rabbit holes consuming increasingly sensational content. Satisfaction drops. Creators gaming watch time get disproportionate exposure. The system needs to predict multiple signals and combine them.

In a multi-objective ranking setup, the model produces several prediction heads from a shared backbone. Each head targets a distinct user engagement or satisfaction signal.

  • P(click): The probability that a user will click on the video thumbnail, capturing immediate interest.

  • E[watch time | click]: The expected watch duration given that the user clicks, measuring depth of engagement.

  • P(like) and P(share): Probabilities of explicit positive feedback, serving as proxies for genuine satisfaction.

  • P(not-dislike): The probability the user does not actively dislike the content, filtering out low-quality recommendations.

The final ranking score combines these predictions through scalarizationA technique that reduces multiple objective values into a single scalar score, typically via a weighted sum or weighted product, so that candidates can be sorted in a single ordered list.. For example, a simplified scoring function might look like score=w1P(click)E[watch time]+w2P(like)+w3P(share)w4P(dislike)\text{score} = w_1 \cdot P(\text{click}) \cdot E[\text{watch time}] + w_2 \cdot P(\text{like}) + w_3 \cdot P(\text{share}) - w_4 \cdot P(\text{dislike}) ...