Search⌘ K
AI Features

Resilient Distributed Datasets

Explore the concept of Resilient Distributed Datasets (RDDs) in Spark, including their immutability, fault tolerance, and distribution across clusters. Understand how to create RDDs from collections, files, and DataFrames, and grasp the importance of transformations, actions, and lineage for efficient big data management.

RDDs

The fundamental abstraction in Spark is the Resilient Distributed Dataset (RDD). It is a read-only (immutable) collection of objects or records, partitioned across the cluster that can be operated on in parallel. A partition can be reconstructed if the node hosting it experiences a failure. RDDs are a lower-level API, and DataFrames and Datasets compile to an RDD. The constituent records or objects within an RDD are Java, Python, or Scala objects. Anything can be stored in any format in these objects.

RDDs are a low-level API, so the Spark authors discourage working directly with them unless we intend to exercise fine grain control. In using RDDs, one sacrifices the optimizations and pre-built functionality that comes with the use of structured APIs such as DataFrames and Datasets. For example, data is compressed and stored in an optimized binary format in case of structured APIs, which has to be manually achieved when working with RDDs.

The following are the properties of RDDs:

  • Resilient: An RDD is fault-tolerant and is able to recompute missing or damaged partitions due to node failures. This self-healing is made possible using an RDD lineage graph that we'll cover in more depth later. Essentially an RDD remembers how it reached its current state and can trace back the steps that got it to its current state ...