How to create a SparkSession on PySpark
A SparkSession provides access to underlying PySpark features for programmatically creating a PySpark Resilient Distributed Dataset (RDD) and DataFrame.
In a PySpark application, we can create as many SparkSession as we like by calling SparkSession.builder() or SparkSession.newSession(). We’ll need a lot of Spark session objects if we want to keep PySpark tables (that are relational entities) logically isolated.
Create a SparkSession
To create a SparkSession in python, we'll need the following methods:
- The
builder()to create aSparkSession. - The
getOrCreate()returns aSparkSessionif it exists, otherwise, it creates a new session. - The
appName()is used to set the application name. - The
master()is used to set the master name as an argument to it (if run on a cluster). When running in standalone mode, we uselocal[x]. When we utilize RDD, DataFrame, and Dataset,xshould be an integer value larger than 0. This represents the number of divisions to be created. Ideally, the x value should be the number of CPU cores.
Code example
Let's look at the code below:
from pyspark.sql import SparkSession
from dotenv import load_dotenv
def create_spark_session():
"""Create a Spark Session"""
_ = load_dotenv()
return (
SparkSession
.builder
.appName("SparkApp")
.master("local[5]")
.getOrCreate()
)
spark = create_spark_session()
print('Session Started')
print('Code Executed Successfully')Creating a SparkSession
Code explanation
- Lines 1–2: We import the required
SparkSessionlibrary to create a PySpark session. - Lines 3–12: We create a function to create a PySpark session.
- Line 3: We define the function.
- Line 5: We load the environment.
- Lines 6–12: We return the PySpark session.
- Line 9: We assign the name of the session.
- Line 10: We create 5 threads as logical cores on our machine locally.
- Line 13: We call the function to create a PySpark session.
- Line 14: We print that our session is started.
Free Resources
Copyright ©2025 Educative, Inc. All rights reserved