Introduction

Let's study distributed data processing systems.

This chapter will examine distributed systems used to process large amounts of data that would be impossible or very inefficient to process using only a single machine.

Categories of distributed data processing systems

Distributed data processing systems can be classified into the following two main categories:

Batch processing systems

Batch processing systems group individual data items into groups called batches, which are processed one at a time. In many cases, these groups can be quite large (e.g., all items for a day), so the main goal for these systems is usually to provide high throughput, but sometimes at the cost of higher latency.

Stream processing systems

Stream processing systems receive and process data continuously as a stream of data items. As a result, the main goal for these systems is to provide very low latency sometimes at the cost of decreased throughput.

Get hands-on with 1200+ tech skills courses.