Code Generation: Evaluation & Privacy

Explore how to systematically evaluate code generation models using functional benchmarks, production metrics, and A/B testing to measure real user value. Understand privacy risks of memorization from training data and apply techniques like data deduplication, differential privacy, and output filtering to mitigate leakage. This lesson prepares you to design robust, privacy-conscious code generation systems that balance accuracy with legal and security requirements.

We'll cover the following...

Functional correctness benchmarks
Production proxy metrics and their limits
A/B testing developer retention
- Designing the experiment
- Practical challenges and guardrails
Memorization risk and training data leakage
- Differential privacy in code model training
  - The privacy-utility trade-off
- Data deduplication as the first line of defense
Summary

When GitHub Copilot first shipped at scale, it didn’t take long for developers to notice something unsettling: the model would occasionally emit entire blocks of code copied verbatim from open-source repositories, complete with license headers and even API keys. This wasn’t a theoretical concern. It triggered real legal scrutiny and forced the industry to treat privacy evaluation as a core design requirement alongside accuracy. For a Staff-level candidate designing an inline code completion product serving millions of developers, the interview question is never just “how good is the model?” It is always “how do you prove it works, and how do you guarantee it doesn’t leak?”

These two pillars, evaluation and privacy, are tightly coupled in production code generation systems. A model that achieves impressive benchmark scores but reproduces copyrighted training code creates legal and security liability that can block a launch entirely. This lesson walks through both pillars systematically, starting with offline benchmarks, moving through production proxy metrics and A/B testing, and then addressing memorization risk with differential privacy and deduplication.

Functional correctness benchmarks

The gold standard for offline evaluation of code generation is functional correctness. A generated code snippet is considered correct if and only if it passes all unit tests for the target problem. This is a binary, unambiguous signal. Unlike style-based or similarity-based metrics, it directly measures whether the code does what it should.

Two public benchmarks dominate this space. HumanEvalA benchmark of 164 hand-written Python programming problems, each with a function signature, docstring, and a set of unit tests used to verify functional correctness. measures performance using pass@k, which computes the probability that at least one of $k$ generated samples passes all tests. MBPP (Mostly Basic Python Problems)A complementary benchmark of approximately 974 crowd-sourced Python tasks with broader coverage but noisier, less carefully curated test cases. also uses pass@k as its primary metric.

Attention: Public benchmarks carry contamination risk. If the model’s training data includes solutions to HumanEval or MBPP problems, reported scores become unreliable. Interviewers expect you to flag this.

Both benchmarks share significant limitations that candidates must acknowledge in a design interview:

Narrow language scope: Both cover only Python, while production code completion systems serve dozens of languages.
...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Code Generation: Evaluation & Privacy

Functional correctness benchmarks