Ad CTR Prediction: Evaluation & Fairness

Explore how to evaluate ad click-through rate prediction models through offline metrics like AUC and log loss, conduct online A/B tests to measure real business impact, and analyze fairness to ensure equitable ad delivery. This lesson also covers ethical design considerations and practical mitigation strategies for bias in production systems.

We'll cover the following...

Offline evaluation metrics
- Core metrics for CTR models
Online evaluation and A/B testing
- Revenue-focused A/B tests
- Guardrail metrics
Fairness in ad delivery systems
- Demographic parity
- Counterfactual fairness
Ethical design and systemic bias
Bridging to serving and trade-offs

In a MAANG system design interview, you have just walked through your CTR model architecture, explained your calibration strategy, and described your multi-task learning setup. The interviewer leans forward and asks, “Great, how would you actually know this model works in production?” This question separates candidates who can build models from those who can ship them. The answer requires a two-phase evaluation paradigm: offline metrics validate model quality before any user sees a prediction, while online experiments measure whether that quality translates into real business impact. A model with excellent ranking ability can still hemorrhage revenue if its probability estimates are poorly calibrated, and a perfectly calibrated model can still cause legal liability if it delivers ads unfairly across demographic groups.

This lesson walks through that full arc, moving from offline metrics to online experimentation, then into fairness analysis and ethical design considerations that Staff+ candidates are expected to raise proactively.

Offline evaluation metrics

Offline evaluation answers a focused question before deployment: does this model produce predictions that are accurate enough in ranking and probability quality to enter a live auction? Three metrics form the standard toolkit for CTR prediction systems.

Core metrics for CTR models

AUC-ROC measures ranking discrimination. It computes the probability that the model scores a randomly chosen positive example (a click) higher than a randomly chosen negative example (a non-click). AUC is insensitive to the absolute values of predicted probabilities, which makes it useful for assessing whether the model can separate clicks from non-clicks but insufficient for verifying that the predicted probabilities themselves are trustworthy.

Log loss (cross-entropy loss) measures how well a model’s predicted probabilities match the true labels. A model with strong AUC but poor log loss is overconfident or underconfident in its estimates, which directly distorts auction economics.

Calibration curves (reliability diagrams) provide a visual diagnostic. You bin predictions into deciles, plot the mean predicted probability against the observed click rate within each bin, and check alignment with the 45-degree diagonal. This is how you verify that the Platt ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Ad CTR Prediction: Evaluation & Fairness

Offline evaluation metrics

Core metrics for CTR models