Video Recommendation: Evaluation, Serving & Trade-Offs
Explore evaluation techniques including offline metrics like Recall@K and NDCG@K, plus online A/B testing to measure real user impact. Understand the two-stage serving pipeline for millisecond latency delivery and how to handle metric cannibalization challenges by balancing watch time with user satisfaction. This lesson helps you design and articulate a scalable video recommendation system from evaluation through serving.
A ranked feed means nothing if you cannot prove it works, serve it in under 200 milliseconds, and prevent it from slowly eroding the trust of the people it serves. The previous lesson established multi-objective ranking with Wide & Deep and DCN architectures, re-ranking for diversity and policy compliance, and fairness as a core design concern. Now the question shifts from what to rank to three interview-critical problems that separate strong candidates from average ones. How do you evaluate a recommendation system both offline and online? How does the two-stage serving pipeline operate end to end under production latency constraints? And what happens when optimizing one metric systematically cannibalizes another?
Picture this scenario: an interviewer asks you to design the evaluation and serving layer for a YouTube-scale video recommendation system. Your answer needs to span metrics, infrastructure, and trade-off reasoning, all in a coherent narrative. This lesson walks through each of those layers and closes with an L4/L5/Staff+ answer comparison so you can calibrate exactly how deep to go at your target level.
Offline evaluation with NDCG and Recall
Before any model touches live traffic, offline evaluation acts as the first quality gate. Think of it as a dress rehearsal: you test against historical data to catch obvious failures before they reach real users.
Recall@K for candidate generation
The candidate generation stage retrieves a shortlist from a corpus of millions. Recall@K measures what fraction of truly relevant videos appears in that top-K retrieved set. If the retrieval model misses good candidates, the downstream ranker never gets a chance to surface them. Low Recall@K starves the entire pipeline.
NDCG@K for ranking quality
Once candidates are retrieved, the ranking model orders them using NDCG (Normalized Discounted Cumulative Gain) as a primary evaluation metric. This metric accounts for the reality that a relevant video at position 1 contributes far more to the model's score than the same video at position 10, accurately mirroring the natural decay of user attention during scrolling.
Ground-truth relevance labels in video recommendation are typically derived from implicit feedback, such as watch time buckets, likes, and shares, rather than explicit star ratings. This makes label construction itself a design decision.
Attention: Offline metrics measure correlation with historical behavior. They cannot capture novelty, diversity, or long-term satisfaction effects. A model that perfectly replicates past clicks may still bore users with repetitive content....