Serving Architectures

Explore the mechanics and trade-offs of synchronous, asynchronous, and streaming serving architectures for ML systems. Understand how to choose the right pattern based on latency, throughput, freshness, and cost constraints. Learn to minimize training-serving skew and design scalable serving solutions that align with business needs and user expectations.

We'll cover the following...

Synchronous real-time inference
- Where synchronous serving is non-negotiable
- Infrastructure and cost implications
Asynchronous batch inference
Streaming near-real-time inference
- Architecture and data flow
- Real-world applications
  - Trade-offs and operational challenges
Decision guide for choosing a paradigm
Conclusion

You are designing a fraud detection system for a payments company. Each transaction needs a risk score before the system decides whether to approve, block, or review it. Should the system compute that score synchronously at request time, precompute it in a batch job, or update it through a streaming pipeline that processes events continuously? This architectural decision affects how quickly the system can detect suspicious activity, how much infrastructure the system needs, and whether predictions use fresh enough features to reflect current fraud patterns. In an ML system design interview, the serving architecture is one of the most important design choices because it shapes user-facing latency, infrastructure cost, and data freshness at the same time.

Three canonical paradigms exist for serving ML predictions: synchronous (real-time) inference, asynchronous (batch) inference, and streaming (near-real-time) inference. Each paradigm carries distinct trade-offs, and production systems at the MAANG scale rarely rely on just one. A shared risk cuts across all three patterns.The divergence between the data distribution or feature computation logic used during model training and what the model encounters during production inference, leading to degraded prediction quality. The choice of serving pattern directly affects how severe the training-serving skewThe divergence between the data distribution or feature computation logic used during model training and what the model encounters during production inference, leading to degraded prediction quality. can become, making it a first-order concern in any design discussion.

This lesson covers the mechanics and trade-offs of each paradigm and provides a structured decision framework. The next lesson on Retrieval and Ranking Pipelines will build directly on these patterns to show how multi-stage candidate generation and ranking funnels operate at scale.

Synchronous real-time inference

Synchronous inference follows a straightforward request-response cycle. The client sends a request, the model server computes a prediction, and the response returns within the same HTTP round-trip, typically targeting sub-50 to 100 ms at the p99 latency percentile.

Where synchronous serving is non-negotiable

Certain applications cannot tolerate stale predictions. Ad ranking at Google or Meta requires bid decisions before the page renders. Search ranking at Airbnb must return results instantly as the user types. Fraud scoring at Stripe must approve or decline a transaction before the ...

1.The Interview Framework and Communication

2.Problem Formulation and Requirements

3.Data Strategy: Collection, Pipelines, and Features

4.Model Design and Architecture Selection

5.Evaluation: Offline, Online, and Fairness

6.Serving, Deployment, and MLOps

7.Case Study: Video Recommendation System

8.Case Study: Social Feed Ranking System

9.Case Study: Ad Click-Through Rate Prediction System

Mock Interview

10.Case Study: Semantic Search Engine

11.Case Study: Content Moderation System

Mock Interview

12.Case Study: Object Detection System

Mock Interview

13.Case Study: Visual Search System

Mock Interview

14.Case Study: Fraud Detection System

Mock Interview

15.Case Study: RAG-Based Enterprise Knowledge Assistant

16.Case Study: LLM-Powered Code Generation Tool

Serving Architectures

Synchronous real-time inference

Where synchronous serving is non-negotiable