Resilient Distributed Datasets (RDDs)
Explore the fundamentals of Resilient Distributed Datasets (RDDs) in PySpark. Understand their key features such as immutability, fault tolerance, partitioning, and lazy evaluation. Learn how to create RDDs from collections, external datasets, and existing RDDs, and discover their role in parallel processing within a cluster environment.
What is an RDD?
An RDD (Resilient Distributed Dataset) is a fundamental building block of PySpark and is the core data structure. It is a low-level object in PySpark. The name RDD captures three important properties:
-
Resilient: Ability to withstand failures
-
Distributed: Spanning across multiple machines
-
Datasets: Collection of partitioned data, e.g., arrays, tables, tuples, etc.
RDD is a fault-tolerant, immutable, distributed collection of elements that can be operated on in parallel. Once created, we can’t change it, and that’s why it is immutable. Each record in RDD is a logical partition that can be computed on a different cluster and, therefore, distributed. We can think of RDD as a list in Python, except that RDD is distributed across multiple nodes in the cluster. So, RDD can’t be modified, while lists can’t be distributed and must be processed on a single CPU.
RDD features
In addition to main features such as fault-tolerant, immutable, and distributed, RDDs have the following additional features:
-
In-memory computation: RDDs can cache data in memory, allowing faster iterative computations by persisting intermediate results.
-
Lazy evaluation: Transformations on RDDs are lazily evaluated, meaning computations are postponed until an action is triggered, optimizing execution plans.
-
Transformations and actions: RDDs support two types of operations: transformations and actions, which we’ll discuss in the next lesson.
How to create an RDD
There are three main methods of creating an RDD: creating by parallelizing an existing collection, creating from existing datasets, and creating from existing RDDs.
Parallelizing an existing collection
In this method, we use the parallelize() function on an existing iterable or collection in our driver ...