Spark Differentiation

Learn what factors and features differentiate the Spark framework from other processing engines.

Spark can be defined as a unified engine for processing large-scale distributed data on-premise in the data center or in the cloud. Some of the key characteristics and differentiators of the Spark framework are as follows:

  • Speed: Spark takes advantage of hardware advances, multithreading, and multiprocessing to execute workloads efficiently. Spark creates a directed acyclic graph (DAG) for computing a query. The graph can be decomposed into tasks that can be executed in parallel across the cluster. The physical execution engine of Spark, known as Tungsten, uses whole-stage code generation for compact code execution. We’ll study these concepts in later lessons. Spark stores intermediate results in memory instead of writing them to disk, which limits disk I/O and boosts performance.

  • Usability: Spark provides for a simplified programming model. At the API core is a data structure called the Resilient Distributed Dataset (RDD), upon which higher abstractions are built.

  • Modularity: Spark supports executing all kinds of workloads (written in supported languages), ranging from batch processing to streaming, under a single execution engine. This feature characterizes Spark as the unified big data processing engine.

  • Extensibility: Unlike Hadoop, Spark decouples storage and computation. Spark focuses on its fast, parallel computation engine rather than on storage. It can read from a variety of data sources, e.g. HDFS, Hive, Cassandra, MongoDB, MySQL. A number of Spark connectors are maintained by the Spark developer community to connect to external data sources, performance monitors and other systems.

MapReduce is ideal for large-scale batch processing jobs but falls short for other workloads such as streaming, machine learning, or interactive SQL-like queries. To fill this gap several other projects such as Apache Storm, Apache Impala, Apache Drill, Apache Mahout, and Apache Giraph came into being, creating a steep learning curve and increasing complexity with their own configurations and APIs. Each specialized engine was customized for a specific type of workload. Spark challenged this state by replacing the separate batch processing, graph, stream, and query engines like Storm, Impala, Dremel, and Pregel. with a unified stack of components that addressed diverse workloads under a single distributed fast engine. Thus Spark can be thought of as a unified engine for big data workloads processing.

Spark is able to process various types of workloads by providing components and libraries that are suitable for each kind of workload. The four components of Spark are:

  • Spark SQL: This module is ideal for working with structured data stored in RDBM tables or file formats such as CSV, JSON, Parquet, and Avro, etc. Once the data has been read in, we can run SQL-like queries to create a Spark DataFrame from the data returned by the query.

  • Spark MLlib: Spark ships with a library containing the most commonly used machine learning algorithms to create models with. The algorithms build on top of DataFrames, which is a higher level API.

  • Spark Structured Streaming: Spark supports computations on streaming data from sources such as Kafka, Kinesis, HDFS, or cloud storage. The functionality is built atop Spark SQL engine based on the DataFrame API. Spark views streaming data as new rows of data being appended to an ever growing table that can be queried by developers like a static table.

  • GraphX: Spark allows for manipulating graphs such as social networks, network topology graphs, or connection points and routes through this module.

These four components are distinct from Spark’s core fault-tolerant engine. Generally, a developer writes code in Java, R, Scala, SQL, or Python which is converted into Java’s bytecode and executed in JVMs across the cluster.