Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

pyspark

How to create an RDD using parallelize() in pyspark

Abhilash

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

The parallelize() method of the spark context is used to create a Resilient Distributed Dataset (RRD) from an iterable or a collection.

Syntax

sparkContext.parallelize(iterable, numSlices)

Parameters

  • iterable: This is an iterable or a collection from which an RDD has to be created.
  • numSlices: This is an optional parameter that indicates the number of slices to cut the RDD into. The number of slices can be manually provided by setting this parameter. Otherwise, the spark will set this to the default parallelism that is inferred from the cluster.

Return value

This method returns an RDD.

Code example

Let’s look at the code below:

main.py
log4j.properties
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('educative-answers').config("spark.some.config.option", "some-value").getOrCreate()

collection = [("James","Smith","USA","CA"),
    ("Michael","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),
    ("Maria","Jones","USA","FL")
  ]

sc = spark.sparkContext

rdd = sc.parallelize(collection)

rdd_elements = rdd.collect()

print("RDD with default slices - ", rdd_elements)

print("Number of partitions - ", rdd.getNumPartitions())

print("-" * 8)

numSlices = 8

rdd = sc.parallelize(collection, numSlices)

rdd_elements = rdd.collect()

print("RDD with default slices - ", rdd_elements)

print("Number of partitions - ", rdd.getNumPartitions())
Example

Code explanation

  • Line 4: A spark session with the app name educative-answers is created.
  • Line 6-10: The collection (or iterable) is defined.
  • Line 12: The spark context object is obtained from the spark session.
  • Line 14: An RDD is constructed from the collection using the parallelize() method. Here, the number of slices is set by the spark.
  • Lines 16 and 28: The elements of the RDD are retrieved using the collect() method as an RDD is distributed in nature.
  • Lines 18 and 30: The elements of the RDD are printed.
  • Lines 20 and 32: The number of partitions of the created RDD is retrieved by getNumPartitions().
  • Line 24: The number of slices is defined.
  • Line 26: An RDD is constructed from the collection using the parallelize() method. Here, the number of slices is set by us.

RELATED TAGS

pyspark

CONTRIBUTOR

Abhilash
Copyright ©2022 Educative, Inc. All rights reserved

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring