The parallelize()
method of the spark context is used to create a Resilient Distributed Dataset (RRD) from an iterable or a collection.
sparkContext.parallelize(iterable, numSlices)
iterable
: This is an iterable or a collection from which an RDD has to be created.numSlices
: This is an optional parameter that indicates the number of slices to cut the RDD into. The number of slices can be manually provided by setting this parameter. Otherwise, the spark will set this to the default parallelism that is inferred from the cluster.This method returns an RDD.
Let’s look at the code below:
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('educative-answers').config("spark.some.config.option", "some-value").getOrCreate()collection = [("James","Smith","USA","CA"),("Michael","Rose","USA","NY"),("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL")]sc = spark.sparkContextrdd = sc.parallelize(collection)rdd_elements = rdd.collect()print("RDD with default slices - ", rdd_elements)print("Number of partitions - ", rdd.getNumPartitions())print("-" * 8)numSlices = 8rdd = sc.parallelize(collection, numSlices)rdd_elements = rdd.collect()print("RDD with default slices - ", rdd_elements)print("Number of partitions - ", rdd.getNumPartitions())
educative-answers
is created.parallelize()
method. Here, the number of slices is set by the spark.collect()
method as an RDD is distributed in nature.getNumPartitions()
.parallelize()
method. Here, the number of slices is set by us.Free Resources