How to create an RDD using parallelize() in pyspark
The parallelize() method of the spark context is used to create a Resilient Distributed Dataset (RRD) from an iterable or a collection.
Syntax
sparkContext.parallelize(iterable, numSlices)
Parameters
iterable: This is an iterable or a collection from which an RDD has to be created.numSlices: This is an optional parameter that indicates the number of slices to cut the RDD into. The number of slices can be manually provided by setting this parameter. Otherwise, the spark will set this to the default parallelism that is inferred from the cluster.
Return value
This method returns an RDD.
Code example
Let’s look at the code below:
main.py
log4j.properties
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('educative-answers').config("spark.some.config.option", "some-value").getOrCreate()collection = [("James","Smith","USA","CA"),("Michael","Rose","USA","NY"),("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL")]sc = spark.sparkContextrdd = sc.parallelize(collection)rdd_elements = rdd.collect()print("RDD with default slices - ", rdd_elements)print("Number of partitions - ", rdd.getNumPartitions())print("-" * 8)numSlices = 8rdd = sc.parallelize(collection, numSlices)rdd_elements = rdd.collect()print("RDD with default slices - ", rdd_elements)print("Number of partitions - ", rdd.getNumPartitions())
Code explanation
- Line 4: A spark session with the app name
educative-answersis created. - Line 6-10: The collection (or iterable) is defined.
- Line 12: The spark context object is obtained from the spark session.
- Line 14: An RDD is constructed from the collection using the
parallelize()method. Here, the number of slices is set by the spark. - Lines 16 and 28: The elements of the RDD are retrieved using the
collect()method as an RDD is distributed in nature. - Lines 18 and 30: The elements of the RDD are printed.
- Lines 20 and 32: The number of partitions of the created RDD is retrieved by
getNumPartitions(). - Line 24: The number of slices is defined.
- Line 26: An RDD is constructed from the collection using the
parallelize()method. Here, the number of slices is set by us.
Free Resources
Copyright ©2026 Educative, Inc. All rights reserved