Batch Processing

Explore the concept of batch processing in distributed systems to understand how large volumes of data are processed offline using jobs running on multiple nodes. This lesson helps you grasp why batch processing is essential for handling massive datasets and how it differs from online request-response systems, equipping you with foundational knowledge to design scalable data processing pipelines.

We'll cover the following...

Processing massive amounts of data
What is batch processing?
An example of batch processing
Key takeaways

Modern systems are data driven more than ever. It is so obvious that we sometimes fail to notice.

Basically, every interaction you have with a system is driven by data. So, how do we process such large amounts of data?

Processing massive amounts of data

During the last decade, access to the internet has become a norm. Today, almost 60% of the world’s population has access to the internet. This is why we have witnessed a surge in online businesses all over the globe.

Many companies have started to see a larger numbers of users in their systems. And when a system has a massive user base, all these users collectively produce a massive amount of data.

But in the past, it was impossible to leverage all the data in a meaningful way. Think of user interaction on an app. A fairly large system with millions of active users every day can dig up very useful insight from user interaction. But due to the unprecedented volume of data, back then it wasn’t obvious how to best use it.

Eventually, technologies were developed that addressed this question. Now, systems can comfortably process terabytes of data using these technologies. On top of data, fields like data science and machine learning emerged stronger than ever.

In short, data is now the core of any system. And because the internet is accessible to many people, there are now many more large-scale systems in the world today as compared to a decade ago. This is why its important we discuss how to process massive volumes of data in the context of distributed systems.

With this brief story in mind, let’s discuss batch processing.

What is batch processing?

So far in this course, we have given examples of systems where a request is sent from a client and the request is received on a server. The server then executes the request, connects to some database to fetch data, and finally, sends back a response. The user usually expects to be served that response as quickly as possible. A system like this is popularly known as an online system.

On the other hand, we also have batch processing systems which can be defined as the following:

A batch processing system takes a large amount of input data, runs a job to process it, and produces some output data.

Designing Data-Intensive Applications, M Kleppmann

There are three important parts to this definition:

Takes a large amount of input data: This means that it does not make sense to run batch processing on top of a small amount of data that can be easily handled by request-response pattern.
Runs a job: This means it could be run periodically or based on some external trigger. In the context of batch processing, every time the system processes a bunch of data, we call that the system is running a job. This job runs for a few minutes to hours, or even for days.
Produces some output data: This means there is an outcome when the batch job finishes. Generally, the outcome is the generation of a set of output data.

One important aspect of batch processing is the execution of the job by using many processes, typically using a cluster of nodes.

An example of batch processing

Let’s solidify our understanding using an example. Assume we want to build a user behavior tracking system for Pinterest using batch processing.

In the diagram, we see an event-processing system for Pinterest. The flow will be as follows:

Interactions of the clients are logged as events and sent to the servers.
Servers read the events, parse the events using any predefined schema, and persist the events in some storage—for example, an S3 bucket by AWS.
There is a batch processing system that wakes up every hour or every day (whichever meets the business requirements), reads the events from the event storage, and processes the data using multiple processes and nodes.
Finally, the batch processing system generates output data that is persisted in some other database, such as MySQL or Cassandra.

The above is a high-level example of the role of batch processing in real-world systems. In the industry, it is a common practice to implement a batch processing pipeline to process a huge amount of data.

Key takeaways

In many cases, we do not need to implement an online request-response system. The offline approach works well in such cases.
Batch processing is a method where a batch of data is processed by one or a cluster of nodes (or processes).

1.Introduction

2.What Distributed Systems Achieve for Us

3.Data in Distributed Systems

4.Communication Between Nodes

5.Data Processing in Large Scale

6.Distributed System Architectural Patterns

7.Case Study 1: Apache Spark

8.Case Study 2: Apache Druid

9.Conclusion

Batch Processing

Processing massive amounts of data

What is batch processing?

An example of batch processing

Key takeaways