Getting started with Spark
Spark has become the ubiquitous platform for data processing and has taken over the traditional MapReduce framework. In fact, some technologists would go so far as to declare MapReduce dead. Spark has been proven to outperform MapReduce by several orders of magnitude in numerous benchmarks and performance studies. Below, we briefly recount the history behind Spark's dominance in the big data space.
The big data movement began in earnest with Google’s ambition to index the world wide web and make it searchable for users at lightning speed. The result was:
Google File System (GFS): A fault-tolerant distributed file system running on clusters of cheap commodity hardware.
Bigtable: A scalable store of structured data on top of GFS.
MapReduce: A new parallel programming paradigm that allows for processing large amounts of data distributed across GFS and Bigtable.
Google’s work was proprietary but the papers coming out of the effort let to Hadoop, an open source implementation of Google’s ideas by Yahoo engineers. The Hadoop project was later donated to Apache. Although MapReduce works well for batch processing, it is cumbersome, complex, has a steep learning curve and takes too long. The weakness of MapReduce is that it writes intermediate results on disk, which slows down the overall computation. Consider the scenario where one MR job’s output is fed into a second job as an input. The first job dumps its output to disk upon completion and then the second job reads the input again from disk. The I/O against the disk slows down the overall workflow.
The shortcomings of MR didn’t go unnoticed. Spark was started as a project in 2009 at University of California Berkeley, with a research paper on the findings published the following year. Later, the same folks created the company Databricks, which focuses on Spark-based machine learning and analytics solutions. Spark has become the de-facto platform for processing data for a wide variety of use cases ranging from batch processing, to interactive ad-hoc query analysis. Scala is the default language for Spark. However, Spark also works with Python, Java, and R.
Differences with MapReduce
So what is it that Spark does better than MapReduce that has led the big data community to abandon MapReduce? To answer this question, we'll look at the use cases where MapReduce paradigm falls short:
Iterative Jobs: Certain machine-learning algorithms make multiple passes on a data set to compute results. Each pass can be expressed as a distinct MapReduce job. T the consequence is that each job reads its input data from disk and then dumps its output to disk for the next job to read. When disk I/O is involved job execution time increases greatly as compared to if the same data were accessed from main memory.
Interactive Analysis: Users can run ad-hoc SQL queries on large datasets using tools such as Hive or Pig. If the user issues multiple queries targeting the same dataset, it is possible that each query under the hood translates to a MapReduce job and reads the same dataset from disk and operates on it. Multiple MapReduce jobs reading the same dataset from disk is inefficient and increases query execution latency.
Rich APIs: Spark offers a variety of rich APIs that can succinctly express an operation that would otherwise consist of dozens of lines of code when expressed in MapReduce. The user and developer experience is simpler when working with Spark versus MapReduce.
The first two use cases are listed in the original Spark paper as the primary drivers that led to the design of an alternative data parallel processing framework other than MapReduce. Spark retains much of the features of MapReduce such as fault tolerance, locality-aware scheduling, and load balancing, while efficiently reusing data by caching it in memory across the cluster. This prevents costly round-trips to disk. Memory accesses are always faster than disk accesses, which explains much of Spark's superiority over MapReduce.
Use cases of Spark
Some prominent use cases of Spark include:
Parallel processing of large data sets across a cluster.
Analyzing large data sets and social networks.
Building, training, and evaluating machine learning models.
Creating data pipelines from various data sources.