Ad CTR Prediction: Model Architecture
Explore key design decisions for ad click-through rate prediction models. Understand how to choose between Wide & Deep, Deep & Cross Network, and DLRM architectures based on memorization and generalization trade-offs. Learn calibration techniques like Platt scaling and isotonic regression to ensure auction accuracy. Discover multi-task learning approaches to optimize CTR, conversion rates, and long-term user value, improving both revenue and user experience.
In the previous lesson, you built four feature families,user profile, ad creative, context, and interaction features,compressed sparse IDs into dense embeddings, and split computation between real-time and pre-computed feature stores. Those features now arrive at the model’s input layer as a mix of sparse categorical embeddings and dense numerical vectors. The design question that follows is one of the most common in MAANG ML system design interviews: given this heterogeneous input under a sub-100 ms latency budget, which model architecture should you choose?
This lesson walks through three design decisions that determine whether your ad CTR prediction system actually works in production. First, you will compare three industry-standard architectures,Wide & Deep, Deep & Cross Network (DCN), and
Architecture comparison
The three architectures below represent the dominant approaches at Google, Meta, and LinkedIn for CTR prediction. Each handles the tension between memorization (recalling specific feature co-occurrences) and generalization (predicting clicks for unseen combinations) differently.
Wide & Deep
Google introduced Wide & Deep in 2016 to combine two complementary learning strategies in a single model. The wide component is a linear model that operates on raw features and manually engineered cross features. If user X has historically clicked ads from advertiser Y, the wide component memorizes that specific co-occurrence through a cross feature like user_id × advertiser_id. The deep component is a standard feed-forward network that takes dense embeddings as input and learns to generalize across unseen feature combinations. Both components are jointly trained end-to-end with a combined logistic loss.
The limitation is clear: the wide component’s power depends entirely on the quality of hand-crafted cross features. Feature engineering becomes a bottleneck as the feature space grows.
Deep & Cross Network (DCN)
DCN replaces the manual cross features with an explicit