Getting Started with Spark
Explore Apache Spark's core features including its fast in-memory processing and resilient distributed datasets. Understand why Spark outperforms Hadoop MapReduce and how to start using it for big data applications like streaming, machine learning, and querying.
Apache Spark
Apache Spark is a computation engine and a stack of tools for big data. It has capabilities around streaming, querying your dataset, Machine Learning (Spark MLlib), and graph processing (GraphX).
Spark is developed in Scala but has bindings for Python, Java, SQL, and R, too.
Spark relies entirely on in-memory processing, which makes it manifold times faster than the performance of respective Hadoop functionalities.
MapReduce and Spark comparison
With the advent of Spark, the MapReduce framework took a backseat due to several reasons mentioned below:
- Iterative jobs: Certain Machine Learning algorithms make multiple passes on a dataset to compute results. Each pass can be expressed as a distinct MapReduce job. However, each job reads its input data from the disk