Class Imbalance and Data Sampling

Explore strategies to handle class imbalance in machine learning systems, including oversampling, undersampling, hard negative mining, and stratified sampling. Understand their impact on model calibration, evaluation validity, and business outcomes. Gain insight into production challenges and best practices to design more reliable and effective ML systems.

We'll cover the following...

Oversampling and undersampling strategies
- Oversampling techniques
- Undersampling and hybrid approaches
Hard negative mining for retrieval systems
- Why negative quality drives model quality
- The mining pipeline and its failure modes
Stratified sampling for rare event detection
Choosing sampling strategies in interviews

A fraud detection model can report 99.9% accuracy on an imbalanced dataset while never predicting the fraud class. In ad click prediction, a model that predicts “no click” for every impression can still reach 98% accuracy. These failures are common in highly imbalanced classification problems. They can occur when a model is trained with standard cross-entropy on a severely skewed label distribution without class weighting, resampling, or an appropriate decision threshold. The majority class can dominate the training signal, and the model may learn a trivial decision rule that predicts the majority label for most inputs.

Class imbalance is common in production ML systems. Fraud detection often has very low positive rates, sometimes well below 1%. Ad click-through rates are often in the low single digits. Rare disease diagnosis can involve extremely low prevalence rates, depending on the condition and population. In these settings, an unweighted model can often reduce loss by favoring the majority class and underpredicting rare positives.

The challenge extends beyond training accuracy. Sampling strategy directly affects model calibration, meaning predicted probabilities no longer reflect true event likelihoods. It affects evaluation validity, where naive test splits can contain too few positives to measure anything meaningful. It also affects downstream business metrics because a fraud model that flags too many or too few transactions has real financial consequences.

This lesson covers three pillars that address class imbalance in system design. First, oversampling and undersampling mechanics with their calibration trade-offs. Second, hard negative mining for retrieval and ranking systems. Third, stratified sampling for rare event detection and evaluation integrity.

Oversampling and undersampling strategies

Handling class imbalance starts with two fundamental levers. Oversampling increases the representation of the minority class in the training set, while undersampling reduces the representation of the majority class. Both aim to rebalance the effective class ratio the model sees during training, but they achieve this through very different mechanisms with distinct failure modes.

Oversampling techniques

The simplest form of oversampling is random duplication of minority class examples. If you have 100 fraud cases and 100,000 legitimate transactions, you copy those 100 fraud cases until the ratio is more balanced. This works but creates a direct overfitting risk. The model memorizes the repeated examples rather than learning generalizable fraud patterns.

...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Class Imbalance and Data Sampling

Oversampling and undersampling strategies

Oversampling techniques