Ad CTR Prediction: Evaluation & Fairness
Explore how to evaluate ad click-through rate prediction models through offline metrics like AUC and log loss, conduct online A/B tests to measure real business impact, and analyze fairness to ensure equitable ad delivery. This lesson also covers ethical design considerations and practical mitigation strategies for bias in production systems.
In a MAANG system design interview, you have just walked through your CTR model architecture, explained your calibration strategy, and described your multi-task learning setup. The interviewer leans forward and asks, “Great, how would you actually know this model works in production?” This question separates candidates who can build models from those who can ship them. The answer requires a two-phase evaluation paradigm: offline metrics validate model quality before any user sees a prediction, while online experiments measure whether that quality translates into real business impact. A model with excellent ranking ability can still hemorrhage revenue if its probability estimates are poorly calibrated, and a perfectly calibrated model can still cause legal liability if it delivers ads unfairly across demographic groups.
This lesson walks through that full arc, moving from offline metrics to online experimentation, then into fairness analysis and ethical design considerations that Staff+ candidates are expected to raise proactively.
Offline evaluation metrics
Offline evaluation answers a focused question before deployment: does this model produce predictions that are accurate enough in ranking and probability quality to enter a live auction? Three metrics form the standard toolkit for CTR prediction systems.
Core metrics for CTR models
AUC-ROC measures ranking discrimination. It computes the probability that the model scores a randomly chosen positive example (a click) higher than a randomly chosen negative example (a non-click). AUC is insensitive to the absolute values of predicted probabilities, which makes it useful for assessing whether the model can separate clicks from non-clicks but insufficient for verifying that the predicted probabilities themselves are trustworthy.
Log loss (cross-entropy loss) measures how well a model’s predicted probabilities match the true labels. A model with strong AUC but poor log loss is overconfident or underconfident in its estimates, which directly distorts auction economics.
Calibration curves (reliability diagrams) provide a visual diagnostic. You bin predictions into deciles, plot the mean predicted probability against the observed click rate within each bin, and check alignment with the 45-degree diagonal. This is how you verify that the Platt ...