Social Feed Ranking: Evaluation & Experimentation
Explore comprehensive evaluation methods for social feed ranking systems to prevent metric cannibalization and ensure fairness. Understand guardrail metrics, interleaving for fast ranking comparison, creator equity measurement, and fairness evaluations. This lesson equips you to design layered experiment frameworks that address engagement, creator diversity, and bias risks critical for senior ML system design interviews.
With a multi-task MMoE architecture, constrained scalarization, and a long-term value head in place, the ranking model is ready to score content. But how do you prove it actually works without introducing hidden regressions? This is the question that separates competent ML engineers from Staff+ candidates in system design interviews. Facebook learned this lesson the hard way when its engagement-optimized feed eroded meaningful social interactions, forcing an architectural overhaul in 2018. The failure was not in the model itself but in the evaluation framework that failed to catch slow-moving damage to platform health.
This lesson covers the four pillars of a robust evaluation strategy for social feed ranking. First, guardrail metrics that prevent metric cannibalization during A/B testing. Second, interleaving as a fast screening method for ranking comparison. Third, creator equity as a core experiment outcome. Fourth, fairness evaluation as a hard design constraint. Mastering these pillars gives you a layered evaluation narrative that interviewers expect at senior levels.
Metric cannibalization in A/B testing
Understanding the failure mode
Metric cannibalization occurs when optimizing for a primary metric such as clicks or session time systematically degrades a secondary metric like unfollow rate, 28-day retention, or content diversity. Standard A/B tests are particularly vulnerable to this because they typically run for one to two weeks, long enough to capture short-term engagement lifts but far too short to observe the slow erosion of connection quality that unfolds over weeks.
Consider a concrete scenario. A new ranking model increases click-through rate by 3%, and the experiment looks like a clear win after two weeks. However, over four weeks the creator mute rate climbs by 8%. Users are clicking more on sensational content but quietly disconnecting from creators they used to value. The short experiment window never surfaces this net negative.
Attention: Metric cannibalization is the single most common evaluation failure in social feed ranking. If your interview answer only mentions engagement metrics, you are leaving a critical gap.
Designing guardrail metrics
Guardrail metrics solve this problem by acting as hard constraints that must not regress beyond a ...