Introduction to Big Data Processing Systems

Explore what we’ll be studying in the upcoming chapters regarding big data processing.


It might not be an understatement to say that data runs our world. From calculating accurate travel times for a map allocation by taking dynamic traffic information into account to personalized recommendations for pretty much all the services, such as shopping, list of songs, etc., it is data that needs to be harnessed to get the right information.

What we will learn

We have selected three big data processing papers to discuss in the following few chapters:

Why did we choose these systems?

There are hundreds of data engines out there and choosing a few was indeed hard. We picked some of the seminal papers that have stood the test of time.

Start of big data processing era

MapReduce showed us how commodity servers can collectively process gigantic amounts of data. Another important aspect of MapReduce was a simple programming model for the end programmers. Traditionally, it has been challenging to use parallel and distributed computing to speed up the processing. MapReduce asks the programmers to write two functions (namely, Map and Reduce), and the system will take care of running them on the data even under different kinds of failures. The success of MapReduce was first witnessed by Google for its WWW crawling and indexing system. The Hadoop project (an open-source implementation of MapReduce and related technologies) spun an era where anyone could process large datasets economically and efficiently.

Focusing on latency

The original focus of MapReduce was on increasing throughput, though there are many use cases where lower latency is also important. The Spark system reduced this latency by keeping and processing the dataset in the cluster RAM. While datasets are often too big to fit into the cluster RAM, Spark uses the working set principle, where only a subset of all data is being actively processed by the applications at any given time. This working set should fit in the cluster RAM. Over time, this working set changes, where new data becomes active and the old data becomes inactive (and therefore, can be pushed to the persistent store).

Enabling real-time processing

While Spark’s processing engine can enable low latency processing, the latency in getting results also comes while collecting data, possibly from many geographically dispersed sources around the world. Kafka was specifically designed to quickly gather and disseminate data between producers and consumers. Spark processing engine can use Kafka as data source or sink.

We hope our selection of big data systems teaches us many important lessons in system design. Let’s dive in!

Create a free account to view this lesson.

By signing up, you agree to Educative's Terms of Service and Privacy Policy