How to create a SparkSession on PySpark

A SparkSession provides access to underlying PySpark features for programmatically creating a PySpark Resilient Distributed Dataset (RDD) and DataFrame.

In a PySpark application, we can create as many SparkSession as we like by calling SparkSession.builder() or SparkSession.newSession(). We’ll need a lot of Spark session objects if we want to keep PySpark tables (that are relational entities) logically isolated.

Create a `SparkSession`

To create a SparkSession in python, we'll need the following methods:

The builder()to create a SparkSession.
The getOrCreate() returns a SparkSession if it exists, otherwise, it creates a new session.
The appName() is used to set the application name.
The master() is used to set the master name as an argument to it (if run on a cluster). When running in standalone mode, we use local[x]. When we utilize RDD, DataFrame, and Dataset, x should be an integer value larger than 0. This represents the number of divisions to be created. Ideally, the x value should be the number of CPU cores.

Code example

Let's look at the code below:

How to create a SparkSession on PySpark

Create a SparkSession

Code example

Code explanation

Create a `SparkSession`