Code Generation: Problem Framing & Requirements
Explore how to frame AI-powered code generation system design, focusing on four completion surfaces and key constraints like latency and privacy. Understand primary business metrics such as accepted completion rate, productivity lift, and user retention. Learn to navigate the 200ms latency budget critical for inline suggestions and address privacy modes essential for enterprise use. Gain the ability to scope your solution appropriately by seniority level to prepare for ML system design interviews.
We'll cover the following...
In the previous case study, we explored how retrieval-augmented generation systems are evaluated for factual grounding. Code generation introduces an entirely different set of pressures. When a developer is typing at full speed and your model’s suggestion arrives 50 milliseconds too late, it lands after the cursor has already moved on. The suggestion is dead on arrival. This is the defining tension of AI-powered code completion: the system must be fast enough to feel invisible, accurate enough to earn trust, and private enough to handle proprietary source code that represents millions of dollars in intellectual property.
Suppose you walk into an interview and hear: “Design the ML system behind a code completion product like GitHub Copilot.” This is not a modeling question. It is a full system design problem that spans model serving under extreme latency budgets, context retrieval from IDE state, privacy-sensitive data handling, and multi-surface product requirements. Tools like Cursor have expanded the surface area beyond simple autocomplete into agentic code generation, making the scoping decision even more consequential.
Before any architecture diagram or model selection, the first step is precise problem framing. You need to define what the system does, how success is measured, and what constraints are non-negotiable. That framing is the focus of this lesson.
Four completion surfaces
A code generation system does not serve a single use case. It supports multiple interaction patterns, each with fundamentally different requirements. Explicitly enumerating these surfaces in an interview and then scoping your deep-dive to one of them demonstrates design maturity.
The four surfaces break down as follows:
Inline suggestions: The model predicts the next tokens as the developer types, rendering them as ghost text in the editor. This is the highest-frequency surface and demands the lowest latency because the developer never explicitly asks for help. The system must anticipate intent from keystrokes alone.
Chat-based generation: The developer describes intent in natural language and receives a multi-line or multi-file code block. Latency tolerance stretches to several seconds, but output quality and correctness expectations are much stricter because the developer explicitly requested a solution.
Test generation: Given a function or class, the model produces unit tests. This requires understanding function signatures, edge cases, and testing frameworks specific to the project’s stack.
Documentation generation: The model produces docstrings, comments, or README content from code context. This demands summarization and ...