Class Imbalance and Data Sampling
Explore strategies to handle class imbalance in machine learning systems, including oversampling, undersampling, hard negative mining, and stratified sampling. Understand their impact on model calibration, evaluation validity, and business outcomes. Gain insight into production challenges and best practices to design more reliable and effective ML systems.
A fraud detection model can report 99.9% accuracy on an imbalanced dataset while never predicting the fraud class. In ad click prediction, a model that predicts “no click” for every impression can still reach 98% accuracy. These failures are common in highly imbalanced classification problems. They can occur when a model is trained with standard cross-entropy on a severely skewed label distribution without class weighting, resampling, or an appropriate decision threshold. The majority class can dominate the training signal, and the model may learn a trivial decision rule that predicts the majority label for most inputs.
Class imbalance is common in production ML systems. Fraud detection often has very low positive rates, sometimes well below 1%. Ad click-through rates are often in the low single digits. Rare disease diagnosis can involve extremely low prevalence rates, depending on the condition and population. In these settings, an unweighted model can often reduce loss by favoring the majority class and underpredicting rare positives.
The challenge extends beyond training accuracy. Sampling strategy directly affects model calibration, meaning predicted probabilities no longer reflect true event likelihoods. It affects evaluation validity, where naive test splits can contain too few positives to measure anything meaningful. It also affects downstream business metrics because a fraud model that flags too many or too few transactions has real financial consequences.
This lesson covers three pillars that address class imbalance in system design. First, oversampling and undersampling mechanics with their calibration trade-offs. Second, hard negative mining for retrieval and ranking systems. Third, stratified sampling for rare event detection and evaluation integrity.
Oversampling and undersampling strategies
Handling class imbalance starts with two fundamental levers. Oversampling increases the representation of the minority class in the training set, while undersampling reduces the representation of the majority class. Both aim to rebalance the effective class ratio the model sees during training, but they achieve this through very different mechanisms with distinct failure modes.
Oversampling techniques
The simplest form of oversampling is random duplication of minority class examples. If you have 100 fraud cases and 100,000 legitimate transactions, you copy those 100 fraud cases until the ratio is more balanced. This works but creates a direct overfitting risk. The model memorizes the repeated examples rather than learning generalizable fraud patterns.
...