Search⌘ K
AI Features

Slice-Based Evaluation

Explore how slice-based evaluation helps you identify hidden performance issues in machine learning models by analyzing metrics on specific user segments. Understand how to define and use slicing functions, compute metrics with statistical confidence, and enforce per-slice thresholds to ensure robust system quality before deployment. This lesson guides you through institutionalizing slice-based evaluation as an essential practice for scalable, fair, and reliable ML systems.

An Airbnb search ranking model reports an overall NDCG@10 of 0.78 during a routine model review. The number looks healthy, the team signs off, and the model ships to production. Two weeks later, a product manager in Jakarta notices that Southeast Asian listings have become nearly unusable, with local NDCG@10 sitting at 0.51. The aggregate metric never revealed the problem because 80% of evaluation traffic originates from North America and Europe, where the model performs well. Strong performance on the majority distribution mathematically drowns out severe degradation on minority segments.

This failure pattern is not hypothetical. It recurs across recommendation systems, ad ranking, fraud detection, and search whenever teams rely on a single number to summarize model quality. The previous lesson on error analysis showed how to categorize failures and trace them to root causes. Slice-based evaluation formalizes that thinking into a repeatable pipeline that catches performance gaps before they reach production.

The core practice involves defining meaningful data subsets (called slices), computing metrics independently for each slice, and enforcing per-slice performance thresholds as release gates. In MAANG-level system design interviews, proactively proposing slice-aware evaluation when designing systems for heterogeneous user populations signals that you understand production ML realities, not just model training.

The following diagram illustrates how a single healthy-looking aggregate metric can conceal two critically failing slices:

Aggregate precision of 0.91 masks underperforming slices below the 0.80 threshold
Aggregate precision of 0.91 masks underperforming slices below the 0.80 threshold

With this failure mode clearly visible, the next question becomes how to systematically define the slices that matter.

Defining slices with slicing functions

What slicing functions are

A slicing functionA programmatic predicate that takes a single data point as input and returns a boolean indicating whether that data point belongs to a particular evaluation subset. partitions the evaluation dataset into meaningful subgroups. Think of it like a SQL WHERE clause ...