Search⌘ K
AI Features

Feature Pipelines: Batch vs. Real-Time vs. Streaming

Explore the three paradigms of feature pipelines in machine learning system design: batch, streaming, and request-time real-time computation. Understand their trade-offs in data freshness, latency, cost, and operational complexity. Learn key architectures like Lambda and Kappa that combine these pipelines, and build a decision framework to choose the right approach for features based on business impact and system requirements.

When a fraud detection model for a payments platform needs the number of transactions on a card in the last five minutes, a feature computed 12 hours earlier is too stale to use. But when a streaming service recommends the next series to watch, overnight user preference profiles may be sufficient. The difference between these scenarios points to a key ML system design decision: whether features should be computed online, offline, or through a hybrid serving path. In an ML system design interview, defaulting to “real-time everything” can show weak cost awareness, while defaulting to “batch everything” can miss latency-sensitive requirements. A strong answer explains the dependency: freshness requirements, latency budget, cost, reliability, and feature availability.

This lesson walks through the three paradigms for feature computation: batch, streaming, and real-time, as a spectrum of freshness vs. cost vs. complexity. You will learn the mechanics of each, see how Lambda and Kappa architectures combine them in production, and build a decision framework you can deploy in any system design discussion.

Batch pipelines for throughput

Batch feature pipelines run as scheduled jobs, typically hourly or daily, that process large volumes of historical data to compute features in bulk. The dominant tools are Apache Spark and Hadoop MapReduce, which distribute computation across clusters to scan terabytes efficiently.

The throughput advantage is significant. A single Spark job can compute aggregate features like 30-day purchase counts, average session durations, or user-level embedding vectors across hundreds of millions of records. Because these jobs operate on bounded datasetsA finite, well-defined collection of data with clear start and end points, as opposed to an unbounded, continuously arriving event stream., they are straightforward to debug, backfill, and unit test.

The freshness limitation is equally clear. Features are only as current as the last completed batch run. A daily pipeline means features can be up to 24 hours stale. For use cases where the decision surface changes slowly, such as content recommendation or credit scoring, this staleness is acceptable and even desirable from a cost perspective.

Practical tip: Batch compute on spot instances, or scheduled clusters, costs significantly less per record than always-on streaming infrastructure. In interviews, explicitly stating “batch is sufficient here because freshness requirements are relaxed” demonstrates cost awareness that interviewers value.

Operational simplicity rounds out the case for batch. When a feature computation bug is discovered, ...