Search⌘ K
AI Features

Serving Architectures

Explore the mechanics and trade-offs of synchronous, asynchronous, and streaming serving architectures for ML systems. Understand how to choose the right pattern based on latency, throughput, freshness, and cost constraints. Learn to minimize training-serving skew and design scalable serving solutions that align with business needs and user expectations.

You are designing a fraud detection system for a payments company. Each transaction needs a risk score before the system decides whether to approve, block, or review it. Should the system compute that score synchronously at request time, precompute it in a batch job, or update it through a streaming pipeline that processes events continuously? This architectural decision affects how quickly the system can detect suspicious activity, how much infrastructure the system needs, and whether predictions use fresh enough features to reflect current fraud patterns. In an ML system design interview, the serving architecture is one of the most important design choices because it shapes user-facing latency, infrastructure cost, and data freshness at the same time.

Three canonical paradigms exist for serving ML predictions: synchronous (real-time) inference, asynchronous (batch) inference, and streaming (near-real-time) inference. Each paradigm carries distinct trade-offs, and production systems at the MAANG scale rarely rely on just one. A shared risk cuts across all three patterns.The divergence between the data distribution or feature computation logic used during model training and what the model encounters during production inference, leading to degraded prediction quality. The choice of serving pattern directly affects how severe the training-serving skewThe divergence between the data distribution or feature computation logic used during model training and what the model encounters during production inference, leading to degraded prediction quality. can become, making it a first-order concern in any design discussion.

This lesson covers the mechanics and trade-offs of each paradigm and provides a structured decision framework. The next lesson on Retrieval and Ranking Pipelines will build directly on these patterns to show how multi-stage candidate generation and ranking funnels operate at scale.

Synchronous real-time inference

Synchronous inference follows a straightforward request-response cycle. The client sends a request, the model server computes a prediction, and the response returns within the same HTTP round-trip, typically targeting sub-50 to 100 ms at the p99 latency percentile.

Where synchronous serving is non-negotiable

Certain applications cannot tolerate stale predictions. Ad ranking at Google or Meta requires bid decisions before the page renders. Search ranking at Airbnb must return results instantly as the user types. Fraud scoring at Stripe must approve or decline a transaction before the ...