The 6-Step ML SD Framework
Explore the comprehensive 6-step framework for machine learning system design interviews that guides you from problem scoping to monitoring. Understand how to clarify ambiguous prompts, develop data strategies, design models with justified trade-offs, evaluate performance offline and online, deploy systems according to production constraints, and maintain them through continuous monitoring and iteration.
An interviewer asks, “Design a fraud detection system for a payments platform.” You have 45 minutes to turn that prompt into a coherent design. Without a plan, many candidates fall into a common failure mode: they jump between model architectures, data pipelines, and evaluation metrics without a clear sequence, making the design hard to evaluate. The fix is not necessarily more ML knowledge; it is a structured approach.
The 6-step ML System Design framework gives you that structure. It is a sequential, interviewer-tested scaffold that ensures completeness, logical flow, and clear communication across any prompt you encounter. The six steps are as follows:
Problem clarification and scope
Data strategy
Model design
Evaluation (offline and online)
Serving and deployment
Monitoring and iteration
Each step builds on the one before it, and together they form a repeatable mental model you can apply whether the prompt is about recommendation systems, ad-click prediction, or retrieval-augmented generation.
This is not a rigid script. It is a thinking tool that signals senior-level design fluency to interviewers at MAANG companies. The rest of this lesson unpacks each step, explains why it sits where it does, and shows you when to break the rules.
The following diagram illustrates how the six steps connect as a pipeline, with a feedback loop from monitoring back to data strategy that reflects the iterative nature of production ML systems.
Steps 1–3: From ambiguity to architecture
The first part of the framework turns a vague system-design prompt into a concrete ML model design. Each step narrows the solution space before you make implementation decisions, so later decisions are based on explicit requirements and constraints.
Step 1: Problem clarification and scope
Every ML system design prompt is intentionally ambiguous. “Design a recommendation system” could mean home page feed ranking, email digest personalization, or related-item suggestions. Your first job is to convert that ambiguity into a precise problem statement.
Ask about the target user, the business objective, scale expectations, and latency constraints. If the interviewer says “recommend videos,” clarify whether the goal is to maximize watch time, increase content diversity, or reduce churn. These distinctions change everything downstream, from the loss function to the serving architecture.
Practical tip: Spend 3–5 minutes on scoping. Interviewers are specifically listening for whether you narrow the problem before jumping to solutions. Candidates who skip this step often end up designing the wrong system entirely.
Step 2: Data strategy
Models are only as good as the data that feeds them. This step belongs immediately after scoping because every model decision, from architecture to training procedure, depends on what signals you can actually obtain.
Cover these areas in your answer:
Data sources: Identify available logs, user profiles, contextual signals, and any third-party data. For the fraud-detection example, transaction logs, device fingerprints, and merchant history are all potential sources.
Labeling approach: Distinguish between explicit feedback (user ratings, reported fraud) and
. Implicit labels are abundant but noisy; explicit labels are clean but sparse.implicit feedback Behavioral signals such as clicks, dwell time, or purchase completions that indirectly indicate user preference without the user explicitly providing a rating. Feature engineering: Decide which raw signals become model features. Aggregations like “number of transactions in the last hour” or embeddings of categorical fields, such as merchant category, are typical choices.
Data freshness: Determine whether features need real-time computation or whether daily batch updates suffice. A fraud system likely needs near-real-time features, while a weekly email recommender does not.
Addressing data quality here prevents cascading problems in model design and evaluation.
Step 3: Model design
With a clear problem and a data strategy in hand, you can now reason about architectures. Interviewers do not want a laundry list of model names. They want
Use a YouTube-style recommendation pipeline as a grounding example. In a common design, candidate generation uses a two-tower model that embeds users and videos in the same vector space, enabling approximate nearest-neighbor search to retrieve likely relevant videos. The ranking stage can then use a deeper model with richer features to score the smaller set returned by candidate generation. If latency is tight or the available training data is mostly structured/tabular, a gradient-boosted tree can be a practical ranking baseline.
Attention: Jumping straight to a transformer or deep learning architecture without justifying why it outperforms a simpler baseline is a red flag for interviewers. Always propose a baseline first, then explain what additional complexity buys you.
The table below summarizes all six steps, what to cover in each, why each occupies its position, and what interviewers are listening for.
ML System Design Interview Framework
Step | What to Cover | Why This Position | Interviewer Signal |
Problem Clarification | Scope, constraints, business metrics | Must precede all design decisions | Structured thinking and ambiguity tolerance |
Data Strategy | Sources, labels, feature engineering | Models depend on available data | Practical data intuition |
Model Design | Architecture, trade-offs, baselines | Requires data understanding first | Trade-off reasoning, not just naming architectures |
Evaluation | Offline metrics (AUC, NDCG), online metrics (A/B tests, business KPIs) | Validates model before deployment | Alignment between offline metrics and business objectives |
Serving and Deployment | Latency, throughput, scalability, feature stores | Bridges model to production | Systems thinking |
Monitoring and Iteration | Concept drift, data drift, alerting, retraining | Ensures long-term reliability | Production maturity mindset |
With the first three steps covered, the next section walks through how you validate, deploy, and maintain the system you have designed.
Steps 4–6: From validation to production
The second part of the framework connects the proposed model design to a production system that can serve users reliably at scale. Candidates often lose points in this part because they rush through evaluation, treat deployment as an afterthought, or omit monitoring.
Step 4: Evaluation
Offline evaluation
Offline metrics measure model quality on held-out data before anything reaches production. The choice of metric depends on the task. Classification problems like fraud detection use precision, recall, and
Online evaluation
Offline metrics alone are insufficient. A model that maximizes click-through rate in offline tests may hurt long-term user retention, which is the metric the business actually cares about. Online evaluation through A/B tests and interleaving experiments measures real-world impact.
Note: Interviewers specifically probe whether you understand the gap between offline metrics and business objectives. It is not enough to say, “We evaluate the model with offline AUC and validate it with an online A/B test.” Explain how you would monitor for metric divergence, such as higher offline AUC paired with lower retention, engagement, or long-term satisfaction in the online test. Then describe how you would investigate the gap, check segment-level impacts, review labels and objectives, and roll back or adjust the model if the product metrics regress.
This alignment question is one of the most reliable signals interviewers use to separate mid-level from senior candidates.
Step 5: Serving and deployment
A model that cannot meet production constraints is a model that never ships. This step covers the infrastructure that bridges training to serving.
Serving pattern: Real-time inference suits latency-sensitive applications like Uber’s ETA prediction, where sub-100 ms response times are non-negotiable. Batch inference works for offline recommendations or periodic risk scoring.
Scalability: Horizontal scaling distributes load across replicas.
is common when a large ranking model must serve millions of queries per second.Model distillation A technique where a smaller "student" model is trained to approximate the predictions of a larger "teacher" model, reducing inference cost and latency while retaining most of the accuracy. Feature stores: A centralized feature store ensures that the features computed during training match those computed during serving, eliminating a common source of training-serving skew.
Canary deployments: Rolling out a new model to a small percentage of traffic before full deployment catches production issues early without exposing all users to risk.
Interviewers are listening for systems thinking here, not just ML knowledge.
Step 6: Monitoring and iteration
Inadequate monitoring is a primary failure mode in production ML systems. Models degrade silently as user behavior shifts, data distributions change, or upstream pipelines break.
Cover concept drift (the relationship between features and labels changes over time), data drift (input feature distributions shift), automated alerting on key metrics, retraining pipelines triggered by drift detection, and shadow scoring where a new model runs alongside the production model without serving results.
This step closes the loop. The dashed arrow in the pipeline diagram points from monitoring back to data strategy because monitoring insights, such as discovering that a new fraud pattern has emerged, feed directly into the next iteration of data collection and feature engineering.
Practical tip: Mentioning shadow scoring and automated retraining pipelines signals production maturity. Many candidates stop at “we retrain periodically,” which is too vague to impress.
The following quiz tests whether you can place a real design decision in the correct framework step:
Lesson Quiz
In an Airbnb search ranking system, you discover that using clicks as positive labels introduces significant noise because many users click on listings but never book. Which framework step should address this issue with label quality?
Problem clarification and scope
Data strategy
Model design
Evaluation
When to deviate from the canonical order
The six steps are a default sequence, not a mandate. Experienced candidates sometimes reorder or merge steps, and doing so deliberately can demonstrate stronger judgment than following the framework rigidly.
Consider this prompt: “Design a real-time ad-click prediction system that serves 1 million QPS.” In this case, serving constraints drive the design. Start by scoping the problem briefly, then cover serving and deployment constraints before selecting the model architecture. Latency, throughput, and cost constraints at that scale eliminate entire classes of models. For example, a transformer ensemble with 500 ms inference latency is unlikely to fit before you even evaluate the rest of the architecture.
A different scenario arises with generative AI prompts such as “design a retrieval-augmented generation system.” Evaluation for generative outputs is unusually complex because there is no single ground-truth answer. Discussing evaluation criteria early, before choosing an architecture, helps align you and the interviewer on what “good” means.
The key principle in both cases is communication. Always signal your deviation explicitly. Saying something like “I’d like to discuss serving constraints first because they’ll heavily constrain our model choices. Does that work for you?” demonstrates the judgment and communication maturity that interviewers at Staff+ levels are specifically evaluating.
Most candidates never deviate. The ones who do it well stand out.
Bringing the framework together
The 6-step framework is a communication tool as much as a thinking tool. It gives the interviewer a mental map of where you are in your answer and where you are headed. Problem clarification converts ambiguity into a precise scope. Data strategy identifies the signals and labels your model will consume. Model design selects an architecture justified by trade-offs. Evaluation validates performance offline and online against business objectives. Serving and deployment bridges the model to production under real-world constraints. Monitoring and iteration ensures the system stays healthy long after launch.
Mastery comes not from memorizing these steps but from understanding why each depends on the ones before it and what trade-offs live within each. In the next lesson, you will layer concrete time management on top of this framework, learning how to budget minutes across the six steps within a 45-minute interview window so that no single step consumes disproportionate time.