Video Recommendation: Evaluation, Serving & Trade-Offs

Explore evaluation techniques including offline metrics like Recall@K and NDCG@K, plus online A/B testing to measure real user impact. Understand the two-stage serving pipeline for millisecond latency delivery and how to handle metric cannibalization challenges by balancing watch time with user satisfaction. This lesson helps you design and articulate a scalable video recommendation system from evaluation through serving.

We'll cover the following...

Offline evaluation with NDCG and Recall
- Recall@K for candidate generation
- NDCG@K for ranking quality
Online evaluation and A/B testing
- Primary and guardrail metrics
- Interleaving experiments
Two-stage serving pipeline architecture
- Stage 1: Candidate generation
- Stage 2: Ranking and re-ranking
  - Re-ranking layer
Metric cannibalization as a design challenge
L4, L5, and Staff+ answer comparison
Summary

A ranked feed means nothing if you cannot prove it works, serve it in under 200 milliseconds, and prevent it from slowly eroding the trust of the people it serves. The previous lesson established multi-objective ranking with Wide & Deep and DCN architectures, re-ranking for diversity and policy compliance, and fairness as a core design concern. Now the question shifts from what to rank to three interview-critical problems that separate strong candidates from average ones. How do you evaluate a recommendation system both offline and online? How does the two-stage serving pipeline operate end to end under production latency constraints? And what happens when optimizing one metric systematically cannibalizes another?

Picture this scenario: an interviewer asks you to design the evaluation and serving layer for a YouTube-scale video recommendation system. Your answer needs to span metrics, infrastructure, and trade-off reasoning, all in a coherent narrative. This lesson walks through each of those layers and closes with an L4/L5/Staff+ answer comparison so you can calibrate exactly how deep to go at your target level.

Offline evaluation with NDCG and Recall

Before any model touches live traffic, offline evaluation acts as the first quality gate. Think of it as a dress rehearsal: you test against historical data to catch obvious failures before they reach real users.

Recall@K for candidate generation

The candidate generation stage retrieves a shortlist from a corpus of millions. Recall@K measures what fraction of truly relevant videos appears in that top-K retrieved set. If the retrieval model misses good candidates, the downstream ranker never gets a chance to surface them. Low Recall@K starves the entire pipeline.

NDCG@K for ranking quality

Once candidates are retrieved, the ranking model orders them using NDCG (Normalized Discounted Cumulative Gain) as a primary evaluation metric. This metric accounts for the reality that a relevant video at position 1 contributes far more to the model's score than the same video at position 10, accurately mirroring the natural decay of user attention during scrolling.

Ground-truth relevance labels in video recommendation are typically derived from implicit feedback, such as watch time buckets, likes, and shares, rather than explicit star ratings. This makes label construction itself a design decision.

Attention: Offline metrics measure correlation with historical behavior. They cannot capture novelty, diversity, or long-term satisfaction effects. A model that perfectly replicates past clicks may still bore users with repetitive content.

...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Video Recommendation: Evaluation, Serving & Trade-Offs

Offline evaluation with NDCG and Recall

Recall@K for candidate generation

NDCG@K for ranking quality