Search⌘ K
AI Features

Stream Processing

Learn about stream processing and its role in handling unbounded data flows in large scale systems. Understand how stream processing reduces delays in data analysis compared to batch methods and explore real-life mechanisms like Apache Kafka and Apache Flink. Gain insight into processing data instantly or in small batches based on business needs.

In the word counting example for the MapReduce algorithm, we noticed that the entire text is stored somewhere, and the mapper machines loaded chunks of the data and processed the chunks. The data is more or less organized in batches. The processing system loads the batches and does the job. Eventually, the system produces some form of output data.

Now, the input is somewhat bounded. We assumed that we had all the text of the English literature. All we had to do is to run the MapReduce algorithm on top of the data and gather results.

Now the important question.

What if the data is unbounded?

How to handle unbounded data

In a real life system where we need data processing, data is almost always unbounded. Let’s quickly discuss an example.

Assume the engineers at Instagram decided to analyze user behaviors on Instagram videos. They will want to track user behavior while using the app to understand how users use the app or interact with some content. The standard practice here is to trigger events from the app to the backend system based on user actions. An example of an event could be a CLICK_EVENT, PAUSE_EVENT, or RESUME_EVENT.

Instagram is a widespread product. Users are continuously using Instagram and watching videos, which means the number of events coming to the Instagram backend will be unbounded. How would Instagram engineers process the events in this case?

One way of doing this would be to run batch processing on top of the data at a certain interval, say every day. So every day, the batch processing pipeline will be triggered at some point. The events data from the earlier day or past 24 hours will be stored in some storage. And the processing pipeline will run on top of the data and generate insights.

Batch processing example for the Instagram use case
Batch processing example for the Instagram use case

Based on the system needs, the frequency of the batch job could be hourly or daily. But one significant downside is that there is a delay. And for some business use cases, such a delay is not exceptable.

Enter stream processing.

What is stream processing?

A stream is a series of data incrementally flowing through the system.

In the example from the previous section, the continuous flow of video events to the Instagram system is a stream.

Note: The idea of stream processing is to process an event or an item from a stream of data as soon as the event appears in the stream. After processing, the stream processor will potentially publish the processed data into a new stream. Consumers interested in the processed data can now directly consume the new stream and act accordingly.

If you think about it, the idea is very intuitive. We wanted to get rid of the delay in batch processing. So, the obvious thing to do here is to process the data as soon as it is available.

How soon should we process the data

An important part of the stream processing system is to figure out how soon we would want to process data. This depends on business use cases and system capabilities. Let’s briefly discuss a few possibilities.

  • Process data instantaneously: In this case, when an event appears in the stream, it is processed immediately without any delay.

  • Process data within seconds: If the requirement is more lenient, we might consider processing each event after a few seconds of delay.

  • Process data in small batches: Sometimes, a small batch of data is processed. For example, we might want to process data in a batch of 1 or 5 minutes.

Generally, the first two options are popularly known as real-time processing. The last one is near-real-time processing.

Stream processing mechanism

Let’s briefly discuss how stream processing is done in real-life. The idea is pretty similar to what we have seen with batch processing.

Data items are continuously put in some storage queue which some stream processing framework is listening to. One example of such a queue is Apache Kafka, which is one of the most famous technologies in the industry. For the stream processing framework, an example could be Apache Flink.

As data is put on the queue, the framework continuously reads data from the queue and runs some actions. As soon as there is a result, the result is then put into a new stream for other systems to consume.

Stream processing example
Stream processing example

Note that an algorithm like MapReduce can also be used in stream processing. It’s just that in the case of stream processing, the framework will probably choose a very small time window (like half a minute or five minutes), and run the algorithm on top of it.

Key-takeaways

  • Stream processing enables us to take action on data as soon as data is available.
  • Based on business use-cases, we can decide how soon we would want to take the actions.