Apache Spark

Apache Spark is a computation engine and a stack of tools for big data. It has capabilities around streaming, querying your dataset, Machine Learning (Spark MLlib), and graph processing (GraphX).

Spark is developed in Scala but has bindings for Python, Java, SQL, and R, too.

Spark relies entirely on in-memory processing, which makes it manifold times faster than the performance of respective Hadoop functionalities.

MapReduce and Spark comparison

With the advent of Spark, the MapReduce framework took a backseat due to several reasons mentioned below:

Iterative jobs: Certain Machine Learning algorithms make multiple passes on a dataset to compute results. Each pass can be expressed as a distinct MapReduce job. However, each job reads its input data from the disk and

...

Before We Begin

Setting The Stage

The Hadoop Ecosystem

Streaming

Apache Spark

Conclusion

Getting Started with Spark

Apache Spark

MapReduce and Spark comparison