Introduction

Learn the evolution and history behind Spark, the ubiquitous and unified big data processing platform.

Getting started with Spark

Spark has become the ubiquitous platform for data processing and has taken over the traditional MapReduce framework. In fact, some technologists would go so far as to declare MapReduce dead. Spark has been proven to outperform MapReduce by several orders of magnitude in numerous benchmarks and performance studies. Below, we briefly recount the history behind Spark's dominance in the big data space.

History

The big data movement began in earnest with Google’s ambition to index the world wide web and make it searchable for users at lightning speed. The result was:

  • Google File System (GFS): A fault-tolerant distributed file system running on clusters of cheap commodity hardware.

  • Bigtable: A scalable store of structured data on top of GFS.

  • MapReduce: A new parallel programming paradigm that allows for processing large amounts of data distributed across GFS and Bigtable.

Google’s work was proprietary but the papers coming out of the effort let to Hadoop, an open source implementation of Google’s ideas by Yahoo engineers. The Hadoop project was later donated to Apache. Although MapReduce works well for batch processing, it is cumbersome, complex, has a steep learning curve and takes too long. The weakness of MapReduce is that it writes intermediate results on disk, which slows down the overall computation. Consider the scenario where one MR job’s output is fed into a second job as an input. The first job dumps its output to disk upon completion and then the second job reads the input again from disk. The I/O against the disk slows down the overall workflow.