Evaluation of Spark
Let's evaluate how Spark fulfills its promised functionalities.
We'll cover the following
Spark can be used efficiently for many data processing use cases. Spark does data processing in memory. Hence, it should provide low latency. Other functionalities that Spark provides include fault tolerance, data locality, persistent in-memory data, and memory management. Let’s discuss how well Spark provides these functionalities with the following arguments.
Note: All the computational results and time spent on them that is stated in the text below are gathered from the paper Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Experiments are done on a data of 100 GB with approximately 25 to 100 machines (for different experiments) with 4 cores and 15 GB RAM each.
Latency
When we use Spark to perform an algorithm that requires more than one iteration, for example, the K-means algorithm or logistic regression, we get to know the speed-up achieved by Spark. Suppose we perform such a task with Hadoop (an open-source implementation of the MapReduce framework). In that case, it will run slower even if we use HadoopBinMem (HadoopBM), which converts the data into a binary format and stores it in a replicated instance of in-memory HDFS for the following reasons.
Overheads: The first overhead that makes Hadoop slower than Spark is the signaling overhead due to
Deserialization cost: Hadoop also takes time to process text and convert binary records to Java objects usable in memory. This overhead occurs even in all cases, whether the data lies in the in-memory HDFS of a local machine or in an in-memory file.
Spark stores RDD elements as Java objects directly in memory to avoid all these overheads.
In the first iteration of the K-means algorithm performed for 10 iterations on 100 machines, Spark completes the first iteration in 82 seconds. Hadoop is a bit slower than Spark because of its heartbeat protocol. HadoopBinMem completes its first iteration in 182 seconds and Hadoop in 115 seconds. HadoopBinMem is the slowest because it has to perform an additional MapReduce
job to convert data into binary format and write it in an instance of in-memory HDFS.
Level up your interview prep. Join Educative to access 70+ hands-on prep courses.