RDD Operations
Learn the basics of RDD operations.
We'll cover the following...
We'll cover the following...
Introduction to RDD operations
There are two types of RDD operations:
-
Transformations: These are RDD operations that create one dataset from another dataset.
-
Actions: These are RDD operations that return a value to the driver program after running a computation on the dataset.
Let’s understand RDD operations through an example:
Python 3.8
from pyspark import SparkContextsc = SparkContext("local", "RDD Operations Example")print("Create a Python list")data = [1, 2, 3, 4, 5]print("Create an RDD from the Python list")rdd = sc.parallelize(data)print("Apply a map transformation to square each element in the RDD")rdd2 = rdd.map(lambda x: x ** 2)print("Apply a reduce transformation to sum up all the elements in the rdd2 RDD")result = rdd2.reduce(lambda x,y : x+y)print(f'Print final result: {result}')
Let’s understand the code:
- Line 1: Import the
SparkContextclass from thepysparkmodule. - Line 2: Create a
SparkContextwith the name “RDD Operations Example.” - Line 5: Create a Python list named
datawith some elements. - Line 8: Use the
parallelize()method of theSparkContextto create an RDD from the Python listdata. Theparallelize()method distributes the data across the cluster, allowing for parallel processing. The resulting RDD is assigned to the variablerdd. - Line 11: The
map()transformation is applied to the RDDrdd. The Lambda functionlambda x: x ** 2is used to square each element of the RDD. The resulting RDD,rdd2, contains the squared values of the original RDD. - Line 14: The
reduce()transformation is applied to the RDDrdd2. The Lambda function,lambda x, y: x + y, is used to sum up the elements of the RDD. Thereduce()operation aggregates the values by repeatedly applying the Lambda function to pairs of elements until only a single value