We just shipped a major update to Grokking Modern System Design Interview — including new lessons on event-driven architectures, stream processing patterns, and designing low-latency data pipelines. If you’re preparing for system design interviews or building real-time infrastructure at work, this is the most comprehensive resource available.
Speaking of real-time systems: in this week’s newsletter, we break down streaming intelligence — the architecture pattern that combines continuous data processing with on-the-fly ML inference. You’ll learn why batch pipelines fall short for time-sensitive decisions, how to design a streaming intelligence pipeline from ingestion to action, and the techniques top engineering teams use to keep sub-second latency without sacrificing model accuracy.
Many intelligent applications fail to deliver instant responsiveness because of outdated data architectures. Systems built on traditional batch
Consider the example of fraud detection in the financial services industry. A system relying on batch processing might collect transaction data throughout the day and run a model overnight. If a fraudulent transaction occurs at 9 a.m., it might not be flagged until 3 a.m. the next day. Within those 18 hours, a malicious actor could execute hundreds more fraudulent transactions, resulting in preventable financial losses.
Streaming intelligence closes this gap by combining real-time data processing with immediate AI inference, enabling applications to respond to events as they occur.
This newsletter explores the architecture and design principles for building responsive, intelligent systems. It covers several key topics:
The core components of a streaming intelligence pipeline.
Techniques for low-latency event processing and online inference.
Strategies for optimizing sub-second decision-making.
Methods to ensure consistency and avoid model performance degradation.
The following diagram illustrates the fundamental difference in response time between these two approaches. Batch and stream users represent the same users, with actions processed either in batches later or in real-time.
Understanding this architectural shift is the first step toward building truly responsive applications. The next sections explore the high-level System Design that enables this.
A responsive streaming intelligence system is built on an architecture designed for continuous data flow and immediate decision-making. This architecture integrates four key stages: event ingestion, stream processing, online inference, and a decision service. Each component is optimized for low latency and high throughput to ensure insights are generated and acted upon in real time.
This architecture can be illustrated with an e-commerce personalization use case.
Event ingestion: Every user action, such as a product click, an “add to cart” event, or a search query, is captured as an event. These events are published to a high-throughput, distributed messaging system like
Stream processing: A stream processing engine, such as
Online inference: Once the stream processor generates features, it triggers an online inference service. This service hosts a trained machine learning model (e.g., for product recommendations) and exposes it via a low-latency API, often using gRPC or REST. The service takes the real-time features as input and returns a prediction, such as a ranked list of recommended products.
Decision service: The prediction from the inference service is sent to a decision service. This final layer applies business logic to the model’s output, such as filtering out out-of-stock items or applying a promotional discount, before serving the final, personalized content back to the user’s application via an API.
The interplay between these components enables the platform to react instantly. A user’s click becomes an event that flows through the stream processor and inference layer, resulting in an updated user experience within milliseconds.
This architectural flow diagram visualizes how these components work together to deliver rapid recommendations.
Next, we will examine the technologies that power the ingestion and processing layers of this architecture.
Building a reliable streaming intelligence pipeline starts with scalable event ingestion and low-latency stream processing. Modern systems rely on tools such as Apache Kafka and Apache Flink, as well as managed cloud services such as
Note: The choice between a self-managed platform like Kafka and a managed service like Kinesis often depends on your team’s operational expertise and the level of control you require over the infrastructure.
Once events are ingested, the real-time processing begins. Stream processors like Flink use specific techniques to analyze data in motion. One important concept is the
Here are a few common processing techniques.
Time-bound aggregation: Calculating metrics over specific time intervals. For example, aggregating the total transaction value per customer over a tumbling (non-overlapping) one-hour window.
Sliding window pattern detection: Identifying sequences of events. A classic example is detecting a pattern of “add to cart,” “view checkout,” and “abandon cart” within a 10-minute sliding window to trigger a follow-up email.
Effective stream processing also depends on event ordering and state management. Maintaining the correct sequence of events is critical for many algorithms. The ability to reliably manage state, such as the current transaction count, enables the system to build a contextual understanding of events over time.
The following illustration shows how these processing windows function within a typical pipeline.
Once events are processed into features, the next step is to use them for decision-making via online inference.
Online inference involves running a machine learning model on incoming events in real time, allowing the system to make predictions immediately. Instead of waiting for a batch job, model inference is triggered by events as they are processed. This allows the application to respond with model-driven decisions in milliseconds, creating a more interactive user experience.
The most common architecture for online inference involves deploying trained models as microservices. These services expose a high-performance API endpoint that the stream processor can call. When the stream processor generates features from an event, it sends them to the inference service, which returns a prediction.
This pattern is versatile and can be applied to many use cases.
Fraud scoring: In fraud detection, before a payment transaction is authorized, its features (amount, location, user history) are sent to a fraud model. The model returns a risk score in real time, allowing the system to block the transaction if the score is too high.
Instant recommendations: When a user clicks on a product, the stream processor can immediately call a recommendation model. The model uses the click event and the user’s historical data to generate a fresh set of personalized recommendations, which are displayed on the user’s next page view.
Practical tip: For extremely low latency, consider using model formats like
Building a scalable and robust serving infrastructure is critical for online inference. The inference services must be highly available and able to handle a high volume of requests without compromising latency. This often involves deploying multiple instances of the service behind a load balancer and using performance monitoring tools to ensure
The sequence diagram below illustrates this real-time decision-making flow for fraud detection. It shows how a user transaction is captured, published to a messaging system, processed to generate features, sent to an online fraud detection model, and then routed based on the model’s output. High-risk alerts are triggered, and reporting dashboards are updated; all in milliseconds.
While this architecture is effective, achieving sub-second latency in mission-critical systems requires careful optimization.
For critical applications like payment authorization or real-time bidding, low latency is essential. Achieving sub-second decision latency requires a System Design approach that optimizes every component in the data path. This involves making strategic architectural choices to minimize network overhead and processing delays, as well as writing efficient code.
One of the most effective strategies is co-location, which involves placing the inference service physically close to the data source or the stream processing cluster. This reduces network latency, a significant bottleneck in distributed systems. For example, deploying the fraud detection model in the same data center as the payment gateway can reduce response time by critical milliseconds.
Caching is another useful technique. Frequently accessed data, such as user feature vectors or model parameters, can be stored in a low-latency in-memory cache, such as Redis. This avoids costly lookups to slower, persistent storage during inference. For example, a personalization system might cache a user’s profile so that it can be retrieved instantly when a new event arrives.
However, these optimizations often involve trade-offs.
Micro-batching: Instead of processing one event at a time, the system can group a small number of events into a micro-batch. This can improve throughput but introduces a small amount of latency. The key is to find the right batch size that balances throughput and latency for your specific use case.
Feature lookup vs. pre-computation: Some features may require lookups to external databases. To reduce latency, these features can be pre-computed and joined with the event stream earlier in the pipeline, but this can increase the complexity of the stream processing logic.
Finally, consistent monitoring and maintenance of latency SLOs are essential. Use distributed tracing and performance monitoring tools to identify bottlenecks and ensure that the system consistently meets its performance requirements.
The following table compares several common optimization strategies.
Technique | Impact on Latency | Complexity | Best Use Case |
Co-location | High | Medium | Geographically sensitive apps |
Caching | High | Medium | Frequently accessed data |
Micro-batching | Low | Low | Throughput-sensitive applications |
Pre-aggregation | Medium | High | Complex feature calculations |
Optimizing for speed is crucial. It is also important to ensure that the data used for real-time decisions is consistent with the data used for offline analysis and model training.
A significant challenge in operating streaming AI systems is maintaining consistency between features generated in the real-time pipeline and those used in the offline environment for model training. Discrepancies between these two data flows can lead to
This problem often arises because the data processing logic for the real-time path is developed separately from the logic for the batch path. Even small differences in implementation can cause feature values to diverge over time. To address this, a best practice is to establish a unified feature pipeline.
One technique for ensuring alignment is to use the immutable event log from a streaming platform (such as Kafka) as the source of truth for both pipelines. The real-time pipeline consumes from the log as events arrive. The batch pipeline can periodically replay the same log to rebuild the feature store for offline analytics and model retraining. This ensures that both systems operate on the exact same source data, minimizing the risk of inconsistency.
For example, a feature store can be rebuilt daily by rerunning the Flink jobs over the last 24 hours of Kafka logs. This ensures that offline features align perfectly with what the online system would have generated.
This schematic shows how event logs can be used to synchronize both environments.
Maintaining data consistency is a key requirement. Next, we will cover how to prevent the closely related problem of training-serving skew.
Training-serving skew occurs when the feature logic used during model training differs from the logic used in live inference, causing the model to perform poorly in production. To prevent this, it’s important to use the same transformation and feature definitions for both batch training and real-time serving. Modern stream processing frameworks allow a single job to compute features on historical data for training and on live events for inference, ensuring consistency between training and serving.
Attention: Even with unified logic, it’s important to have robust monitoring in place to detect data drift. Periodically compare the statistical distributions (mean, variance, etc.) of features generated in production with those from the training set to catch any unexpected divergence.
By creating a single source of truth for feature definitions, you eliminate a major source of production model failures and ensure your intelligence systems perform as expected. The visual below illustrates how shared feature logic can be integrated into both workloads.
While these architectures provide significant benefits, they also introduce new operational challenges that teams must be prepared to handle.
Operationalizing streaming AI systems introduces architectural complexities and trade-offs compared to traditional batch systems. Production deployments require careful consideration of state management, event handling, and processing guarantees.
One of the primary challenges is managing state in a distributed environment. Stream processors need to maintain a fault-tolerant state to perform calculations like windowed aggregations. Another common issue is handling late, missing, or out-of-order events, which can corrupt the state and lead to incorrect calculations if not handled correctly.
Engineers must also make trade-offs in processing guarantees. For example, choosing between
What is the primary function of the “co-location” optimization strategy in a critical inference flow?
To ensure data consistency between offline and online stores
To reduce the cost of storage by compressing data
To increase the throughput by processing events in large batches
To reduce network latency by placing services physically close together
Scaling these systems can reveal operational issues like event lag, where the processing pipeline cannot keep up with the rate of incoming data, or slow recovery times after a failure. Addressing these challenges requires a deep understanding of the underlying tools and a pragmatic approach to System Design, choosing the right trade-offs for your business context.
The shift from batch processing to streaming intelligence represents a change in how data-driven applications are built. By merging real-time data pipelines with low-latency AI inference, it is possible to create responsive systems capable of making intelligent decisions at the right time.
Building these systems requires a methodical approach to architecture and performance optimization, along with a clear understanding of operational trade-offs. The primary goal is to build systems that are fast, reliable, and consistent. The principles and patterns discussed here provide a foundation for designing the next generation of intelligent applications. If you want to go deeper into the patterns behind merging real-time data and AI inference, explore our expert-led courses.