How to create an RDD using parallelize() in pyspark

The parallelize() method of the spark context is used to create a Resilient Distributed Dataset (RRD) from an iterable or a collection.

Syntax

sparkContext.parallelize(iterable, numSlices)

Parameters

iterable: This is an iterable or a collection from which an RDD has to be created.
numSlices: This is an optional parameter that indicates the number of slices to cut the RDD into. The number of slices can be manually provided by setting this parameter. Otherwise, the spark will set this to the default parallelism that is inferred from the cluster.

Return value

This method returns an RDD.

Code example

Let’s look at the code below:

main.py

log4j.properties

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('educative-answers').config("spark.some.config.option", "some-value").getOrCreate()
collection = [("James","Smith","USA","CA"),
    ("Michael","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),
    ("Maria","Jones","USA","FL")
  ]
sc = spark.sparkContext
rdd = sc.parallelize(collection)
rdd_elements = rdd.collect()
print("RDD with default slices - ", rdd_elements)
print("Number of partitions - ", rdd.getNumPartitions())
print("-" * 8)
numSlices = 8
rdd = sc.parallelize(collection, numSlices)
rdd_elements = rdd.collect()
print("RDD with default slices - ", rdd_elements)
print("Number of partitions - ", rdd.getNumPartitions())

Code explanation

Line 4: A spark session with the app name educative-answers is created.
Line 6-10: The collection (or iterable) is defined.
Line 12: The spark context object is obtained from the spark session.
Line 14: An RDD is constructed from the collection using the parallelize() method. Here, the number of slices is set by the spark.
Lines 16 and 28: The elements of the RDD are retrieved using the collect() method as an RDD is distributed in nature.
Lines 18 and 30: The elements of the RDD are printed.
Lines 20 and 32: The number of partitions of the created RDD is retrieved by getNumPartitions().
Line 24: The number of slices is defined.
Line 26: An RDD is constructed from the collection using the parallelize() method. Here, the number of slices is set by us.

Free Resources