Slice-Based Evaluation

Explore how slice-based evaluation helps you identify hidden performance issues in machine learning models by analyzing metrics on specific user segments. Understand how to define and use slicing functions, compute metrics with statistical confidence, and enforce per-slice thresholds to ensure robust system quality before deployment. This lesson guides you through institutionalizing slice-based evaluation as an essential practice for scalable, fair, and reliable ML systems.

We'll cover the following...

Defining slices with slicing functions
- What slicing functions are
- Where slices come from
Computing per-slice metrics and setting thresholds
- Running the per-slice evaluation loop
- Setting asymmetric thresholds
  - The slice performance matrix
Institutionalizing slice-based evaluation
Conclusion

An Airbnb search ranking model reports an overall NDCG@10 of 0.78 during a routine model review. The number looks healthy, the team signs off, and the model ships to production. Two weeks later, a product manager in Jakarta notices that Southeast Asian listings have become nearly unusable, with local NDCG@10 sitting at 0.51. The aggregate metric never revealed the problem because 80% of evaluation traffic originates from North America and Europe, where the model performs well. Strong performance on the majority distribution mathematically drowns out severe degradation on minority segments.

This failure pattern is not hypothetical. It recurs across recommendation systems, ad ranking, fraud detection, and search whenever teams rely on a single number to summarize model quality. The previous lesson on error analysis showed how to categorize failures and trace them to root causes. Slice-based evaluation formalizes that thinking into a repeatable pipeline that catches performance gaps before they reach production.

The core practice involves defining meaningful data subsets (called slices), computing metrics independently for each slice, and enforcing per-slice performance thresholds as release gates. In MAANG-level system design interviews, proactively proposing slice-aware evaluation when designing systems for heterogeneous user populations signals that you understand production ML realities, not just model training.

The following diagram illustrates how a single healthy-looking aggregate metric can conceal two critically failing slices:

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Slice-Based Evaluation

Defining slices with slicing functions

What slicing functions are