Getting Started with Spark

Apache Spark

Apache Spark is a computation engine and a stack of tools for big data. It has capabilities around streaming, querying your dataset, Machine Learning (Spark MLlib), and graph processing (GraphX).

Spark is developed in Scala but has bindings for Python, Java, SQL, and R, too.

Spark relies entirely on in-memory processing, which makes it manifold times faster than the performance of respective Hadoop functionalities.

MapReduce and Spark comparison

With the advent of Spark, the MapReduce framework took a backseat due to several reasons mentioned below:

  • Iterative jobs: Certain Machine Learning algorithms make multiple passes on a dataset to compute results. Each pass can be expressed as a distinct MapReduce job. However, each job reads its input data from the disk and then dumps its output to the disk for the next job to read. When disk I/O is involved, the job execution time increases manifold when compared to the same data accessed from the main memory.
  • Interactive analysis: Users can run ad-hoc SQL queries on large datasets using tools such as Hive or Pig. If the user issues multiple queries targeting the same dataset, each query may translate to a MapReduce job, read the same dataset from disk, and operate on it. Having multiple MapReduce jobs read the same dataset from the disk is inefficient, and increases query execution latency.
  • Rich APIs: Spark, by offering a variety of rich APIs, can succinctly express an operation that would otherwise consist of many lines of code when expressed in MapReduce. The user and developer experience is relatively simpler when working with Spark, as compared to MapReduce.

Get hands-on with 1200+ tech skills courses.