The 6-Step ML SD Framework

Explore the comprehensive 6-step framework for machine learning system design interviews that guides you from problem scoping to monitoring. Understand how to clarify ambiguous prompts, develop data strategies, design models with justified trade-offs, evaluate performance offline and online, deploy systems according to production constraints, and maintain them through continuous monitoring and iteration.

We'll cover the following...

Steps 1–3: From ambiguity to architecture
Steps 4–6: From validation to production
When to deviate from the canonical order
Bringing the framework together

An interviewer asks, “Design a fraud detection system for a payments platform.” You have 45 minutes to turn that prompt into a coherent design. Without a plan, many candidates fall into a common failure mode: they jump between model architectures, data pipelines, and evaluation metrics without a clear sequence, making the design hard to evaluate. The fix is not necessarily more ML knowledge; it is a structured approach.

The 6-step ML System Design framework gives you that structure. It is a sequential, interviewer-tested scaffold that ensures completeness, logical flow, and clear communication across any prompt you encounter. The six steps are as follows:

Problem clarification and scope
Data strategy
Model design
Evaluation (offline and online)
Serving and deployment
Monitoring and iteration

Each step builds on the one before it, and together they form a repeatable mental model you can apply whether the prompt is about recommendation systems, ad-click prediction, or retrieval-augmented generation.

This is not a rigid script. It is a thinking tool that signals senior-level design fluency to interviewers at MAANG companies. The rest of this lesson unpacks each step, explains why it sits where it does, and shows you when to break the rules.

The following diagram illustrates how the six steps connect as a pipeline, with a feedback loop from monitoring back to data strategy that reflects the iterative nature of production ML systems.

Steps 1–3: From ambiguity to architecture

The first part of the framework turns a vague system-design prompt into a concrete ML model design. Each step narrows the solution space before you make implementation decisions, so later decisions are based on explicit requirements and constraints.

Step 1: Problem clarification and scope

Every ML system design prompt is intentionally ambiguous. “Design a recommendation system” could mean home page feed ranking, email digest personalization, or related-item suggestions. Your first job is to convert that ambiguity into a precise problem statement.

Ask about the target user, the business objective, scale expectations, and latency constraints. If the interviewer says “recommend videos,” clarify whether the goal is to maximize watch time, increase content diversity, or reduce churn. These distinctions change everything downstream, from the loss function to the serving architecture.

Practical tip: Spend 3–5 minutes on scoping. Interviewers are specifically listening for whether you narrow the problem before jumping to solutions. Candidates who skip this step often end up designing the wrong system entirely.

Step 2: Data strategy

Models are only as good as the data that feeds them. This step belongs immediately after scoping because every model decision, from architecture to training procedure, depends on what signals you can actually obtain.

Cover these areas in your answer:

Data sources: Identify available logs, user profiles, contextual signals, and any third-party data. For the fraud-detection example, transaction logs, device fingerprints, and merchant history are all potential sources.
Labeling approach: Distinguish between explicit feedback (user ratings, reported fraud) and implicit feedbackBehavioral signals such as clicks, dwell time, or purchase completions that indirectly indicate user preference without the user explicitly providing a rating.. Implicit labels are abundant but noisy; explicit labels are clean but sparse.
Feature engineering: Decide which raw signals become model features. Aggregations like “number of transactions in the last hour” or embeddings of categorical fields, such as merchant category, are typical choices.
Data freshness: Determine whether features need real-time computation or whether daily batch updates suffice. A fraud system likely needs near-real-time features, while a weekly email recommender does not.

Addressing data quality here prevents cascading problems in model design and evaluation.

Step 3: Model design

With a clear problem and a data strategy in hand, you can now reason about architectures. Interviewers do not want a laundry list of model names. They want trade-off reasoningThe practice of explicitly comparing competing design choices along dimensions such as latency, accuracy, data requirements, and engineering complexity to justify a decision..

Use a YouTube-style recommendation pipeline as a grounding example. In a common design, candidate generation uses a two-tower model that embeds users and videos in the same vector space, enabling approximate nearest-neighbor search to retrieve likely relevant videos. The ranking stage can then use a deeper model with richer features to score the smaller set returned by candidate generation. If latency is tight or the available training data is mostly structured/tabular, a gradient-boosted tree can be a practical ranking baseline.

Attention: Jumping straight to a transformer or deep learning architecture without justifying why it outperforms a simpler baseline is a red flag for interviewers. Always propose a baseline first, then explain what additional complexity buys you.

The table below summarizes all six steps, what to cover in each, why each occupies its position, and what interviewers are listening for.

ML System Design Interview Framework

Step	What to Cover	Why This Position	Interviewer Signal
Problem Clarification	Scope, constraints, business metrics	Must precede all design decisions	Structured thinking and ambiguity tolerance
Data Strategy	Sources, labels, feature engineering	Models depend on available data	Practical data intuition
Model Design	Architecture, trade-offs, baselines	Requires data understanding first	Trade-off reasoning, not just naming architectures
Evaluation	Offline metrics (AUC, NDCG), online metrics (A/B tests, business KPIs)	Validates model before deployment	Alignment between offline metrics and business objectives
Serving and Deployment	Latency, throughput, scalability, feature stores	Bridges model to production	Systems thinking
Monitoring and Iteration	Concept drift, data drift, alerting, retraining	Ensures long-term reliability	Production maturity mindset

With the first three steps covered, the next section walks through how you validate, deploy, and maintain the system you have designed.

Steps 4–6: From validation to production

The second part of the framework connects the proposed model design to a production system that can serve users reliably at scale. Candidates often lose points in this part because they rush through evaluation, treat deployment as an afterthought, or omit monitoring.

Step 4: Evaluation

Offline evaluation

Offline metrics measure model quality on held-out data before anything reaches production. The choice of metric depends on the task. Classification problems like fraud detection use precision, recall, and AUC (Area under the ROC curve)A metric that measures a classifier's ability to distinguish between positive and negative classes across all decision thresholds, where 1.0 is perfect and 0.5 is random.. Ranking problems like search or recommendations use NDCG (Normalized discounted cumulative gain).A ranking metric that evaluates how well a system places relevant items near the top of a ranked list, accounting for the position of each relevant result.

Online evaluation

Offline metrics alone are insufficient. A model that maximizes click-through rate in offline tests may hurt long-term user retention, which is the metric the business actually cares about. Online evaluation through A/B tests and interleaving experiments measures real-world impact.

Note: Interviewers specifically probe whether you understand the gap between offline metrics and business objectives. It is not enough to say, “We evaluate the model with offline AUC and validate it with an online A/B test.” Explain how you would monitor for metric divergence, such as higher offline AUC paired with lower retention, engagement, or long-term satisfaction in the online test. Then describe how you would investigate the gap, check segment-level impacts, review labels and objectives, and roll back or adjust the model if the product metrics regress.

This alignment question is one of the most reliable signals interviewers use to separate mid-level from senior candidates.

Step 5: Serving and deployment

A model that cannot meet production constraints is a model that never ships. This step covers the infrastructure that bridges training to serving.

Serving pattern: Real-time inference suits latency-sensitive applications like Uber’s ETA prediction, where sub-100 ms response times are non-negotiable. Batch inference works for offline recommendations or periodic risk scoring.
Scalability: Horizontal scaling distributes load across replicas. Model distillationA technique where a smaller "student" model is trained to approximate the predictions of a larger "teacher" model, reducing inference cost and latency while retaining most of the accuracy. is common when a large ranking model must serve millions of queries per second.
Feature stores: A centralized feature store ensures that the features computed during training match those computed during serving, eliminating a common source of training-serving skew.
Canary deployments: Rolling out a new model to a small percentage of traffic before full deployment catches production issues early without exposing all users to risk.

Interviewers are listening for systems thinking here, not just ML knowledge.

Step 6: Monitoring and iteration

Inadequate monitoring is a primary failure mode in production ML systems. Models degrade silently as user behavior shifts, data distributions change, or upstream pipelines break.

Cover concept drift (the relationship between features and labels changes over time), data drift (input feature distributions shift), automated alerting on key metrics, retraining pipelines triggered by drift detection, and shadow scoring where a new model runs alongside the production model without serving results.

This step closes the loop. The dashed arrow in the pipeline diagram points from monitoring back to data strategy because monitoring insights, such as discovering that a new fraud pattern has emerged, feed directly into the next iteration of data collection and feature engineering.

Practical tip: Mentioning shadow scoring and automated retraining pipelines signals production maturity. Many candidates stop at “we retrain periodically,” which is too vague to impress.

The following quiz tests whether you can place a real design decision in the correct framework step:

When to deviate from the canonical order

The six steps are a default sequence, not a mandate. Experienced candidates sometimes reorder or merge steps, and doing so deliberately can demonstrate stronger judgment than following the framework rigidly.

Consider this prompt: “Design a real-time ad-click prediction system that serves 1 million QPS.” In this case, serving constraints drive the design. Start by scoping the problem briefly, then cover serving and deployment constraints before selecting the model architecture. Latency, throughput, and cost constraints at that scale eliminate entire classes of models. For example, a transformer ensemble with 500 ms inference latency is unlikely to fit before you even evaluate the rest of the architecture.

A different scenario arises with generative AI prompts such as “design a retrieval-augmented generation system.” Evaluation for generative outputs is unusually complex because there is no single ground-truth answer. Discussing evaluation criteria early, before choosing an architecture, helps align you and the interviewer on what “good” means.

The key principle in both cases is communication. Always signal your deviation explicitly. Saying something like “I’d like to discuss serving constraints first because they’ll heavily constrain our model choices. Does that work for you?” demonstrates the judgment and communication maturity that interviewers at Staff+ levels are specifically evaluating.

Most candidates never deviate. The ones who do it well stand out.

Bringing the framework together

The 6-step framework is a communication tool as much as a thinking tool. It gives the interviewer a mental map of where you are in your answer and where you are headed. Problem clarification converts ambiguity into a precise scope. Data strategy identifies the signals and labels your model will consume. Model design selects an architecture justified by trade-offs. Evaluation validates performance offline and online against business objectives. Serving and deployment bridges the model to production under real-world constraints. Monitoring and iteration ensures the system stays healthy long after launch.

Mastery comes not from memorizing these steps but from understanding why each depends on the ones before it and what trade-offs live within each. In the next lesson, you will layer concrete time management on top of this framework, learning how to budget minutes across the six steps within a 45-minute interview window so that no single step consumes disproportionate time.

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

The 6-Step ML SD Framework

Steps 1–3: From ambiguity to architecture

Step 1: Problem clarification and scope

Step 2: Data strategy

Step 3: Model design

ML System Design Interview Framework

Steps 4–6: From validation to production

Step 4: Evaluation

Offline evaluation

Online evaluation

Step 5: Serving and deployment

Step 6: Monitoring and iteration

When to deviate from the canonical order

Bringing the framework together