Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

python
apache spark
pyspark

How to create a SparkSession on PySpark

Muhammad Muzammil

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

A SparkSession provides access to underlying PySpark features for programmatically creating a PySpark Resilient Distributed Dataset (RDD) and DataFrame.

In a PySpark application, we can create as many SparkSession as we like by calling SparkSession.builder() or SparkSession.newSession(). We’ll need a lot of Spark session objects if we want to keep PySpark tables (that are relational entities) logically isolated.

Create a SparkSession

To create a SparkSession in python, we'll need the following methods:

  • The builder()to create a SparkSession.
  • The getOrCreate() returns a SparkSession if it exists, otherwise, it creates a new session.
  • The appName() is used to set the application name.
  • The master() is used to set the master name as an argument to it (if run on a cluster). When running in standalone mode, we use local[x]. When we utilize RDD, DataFrame, and Dataset, x should be an integer value larger than 0. This represents the number of divisions to be created. Ideally, the x value should be the number of CPU cores.

Code example

Let's look at the code below:

from pyspark.sql import SparkSession
from dotenv import load_dotenv
def create_spark_session():
    """Create a Spark Session"""
    _ = load_dotenv()
    return (
        SparkSession
        .builder
        .appName("SparkApp")
        .master("local[5]")
        .getOrCreate()
    )
spark = create_spark_session()
print('Session Started')
print('Code Executed Successfully')

Code explanation

  • Lines 1–2: We import the required SparkSession library to create a PySpark session.
  • Lines 3–12: We create a function to create a PySpark session.
    • Line 3: We define the function.
    • Line 5: We load the environment.
    • Lines 6–12: We return the PySpark session.
      • Line 9: We assign the name of the session.
      • Line 10: We create 5 threads as logical cores on our machine locally.
  • Line 13: We call the function to create a PySpark session.
  • Line 14: We print that our session is started.

RELATED TAGS

python
apache spark
pyspark

CONTRIBUTOR

Muhammad Muzammil
Copyright ©2022 Educative, Inc. All rights reserved

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring