Loss Functions and Optimization Choices

Explore the principles of selecting loss functions and optimization methods for machine learning models. Understand how classification and ranking losses, including cross-entropy, focal loss, BPR, LambdaRank, and LambdaMART, impact model performance and business metrics. Discover calibration techniques like Platt scaling and isotonic regression to ensure reliable probability outputs in production ML systems.

We'll cover the following...

Cross-entropy and focal loss
- Standard cross-entropy as the default
- Focal loss for imbalanced settings
Ranking losses for ordered retrieval
- Pairwise ranking with BPR
- Metric-aligned optimization with LambdaRank and LambdaMART
Model calibration in production
- Why calibration matters
- Diagnosing and fixing miscalibration
  - Reliability diagrams and metrics
  - Post-hoc calibration techniques
Summary

In an ads ranking interview, you confidently describe your multi-task model architecture, define the task heads, and then the interviewer asks: “Which loss function does each head optimize, and why?” The wrong answer here is not just a theoretical misstep. A model that maximizes AUC but produces miscalibrated click probabilities can distort bid prices in a second-price auction, silently costing millions in revenue before anyone notices. Loss function selection is a production design decision with direct consequences for ranking quality, calibration, and business metrics.

This lesson covers three common loss-function families used often in production ML systems and ML system design interviews. Classification losses, specifically cross-entropy and focal loss, handle per-item prediction tasks. Ranking losses, including BPR, LambdaRank, and LambdaMART, optimize the relative ordering of items for retrieval and search. Finally, calibration techniques ensure that the probabilities your model outputs actually mean what they claim. Each choice encodes assumptions about data distribution, task structure, and serving requirements.

Cross-entropy and focal loss

Standard cross-entropy as the default

Binary cross-entropy computes the negative log-likelihood of the correct class. For a single example with a true label $y \in \{0, 1\}$ and predicted probability $p$ , the loss is $L = -[y \log(p) + (1-y) \log(1-p)]$ . This formulation produces smooth gradients that push predicted probabilities toward the true label, making it the industry standard for CTR prediction heads, binary classification, and multi-class tasks (via its categorical extension).

Cross-entropy works well when positive and negative examples are roughly balanced. The gradient signal from each example contributes meaningfully to parameter updates, and the resulting model outputs approximate true posterior probabilities under ideal conditions.

Note: Cross-entropy’s probabilistic interpretation is why it remains the default for any task where you need calibrated probability outputs, such as predicting P(click) in an ads system.

Focal loss for imbalanced settings

Production systems frequently encounter extreme class imbalance. In fraud detection, fewer than 1 in 10,000 transactions may be fraudulent. In dense object detection, a single image generates thousands of anchor boxes, and only a handful overlap with actual objects. Standard cross-entropy assigns gradient signal to every example proportionally, so the overwhelming majority of easy negatives dominate the total gradient. The model learns to predict “not fraud” or “background” with high confidence but struggles to identify the rare positives that matter.

Focal loss is a modified cross-entropy loss that adds a modulating factor $(1 - p_t)^\gamma$ ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Loss Functions and Optimization Choices

Cross-entropy and focal loss

Standard cross-entropy as the default

Focal loss for imbalanced settings