Batch Processing

Get to know the significance of data processing in distributed systems.

Modern systems are data driven more than ever. It is so obvious that we sometimes fail to notice.

Basically, every interaction you have with a system is driven by data. So, how do we process such large amounts of data?

Processing massive amounts of data

During the last decade, access to the internet has become a norm. Today, almost 60% of the world’s population has access to the internet. This is why we have witnessed a surge in online businesses all over the globe.

Many companies have started to see a larger numbers of users in their systems. And when a system has a massive user base, all these users collectively produce a massive amount of data.

But in the past, it was impossible to leverage all the data in a meaningful way. Think of user interaction on an app. A fairly large system with millions of active users every day can dig up very useful insight from user interaction. But due to the unprecedented volume of data, back then it wasn’t obvious how to best use it.

Eventually, technologies were developed that addressed this question. Now, systems can comfortably process terabytes of data using these technologies. On top of data, fields like data science and machine learning emerged stronger than ever.

In short, data is now the core of any system. And because the internet is accessible to many people, there are now many more large-scale systems in the world today as compared to a decade ago. This is why its important we discuss how to process massive volumes of data in the context of distributed systems.

With this brief story in mind, let’s discuss batch processing.

What is batch processing?

So far in this course, we have given examples of systems where a request is sent from a client and the request is received on a server. The server then executes the request, connects to some database to fetch data, and finally, sends back a response. The user usually expects to be served that response as quickly as possible. A system like this is popularly known as an online system.

On the other hand, we also have batch processing systems which can be defined as the following:

A batch processing system takes a large amount of input data, runs a job to process it, and produces some output data.

Designing Data-Intensive Applications, M Kleppmann

There are three important parts to this definition:

  • Takes a large amount of input data: This means that it does not make sense to run batch processing on top of a small amount of data that can be easily handled by request-response pattern.
  • Runs a job: This means it could be run periodically or based on some external trigger. In the context of batch processing, every time the system processes a bunch of data, we call that the system is running a job. This job runs for a few minutes to hours, or even for days.
  • Produces some output data: This means there is an outcome when the batch job finishes. Generally, the outcome is the generation of a set of output data.

One important aspect of batch processing is the execution of the job by using many processes, typically using a cluster of nodes.

An example of batch processing

Let’s solidify our understanding using an example. Assume we want to build a user behavior tracking system for Pinterest using batch processing.

Create a free account to view this lesson.

By signing up, you agree to Educative's Terms of Service and Privacy Policy