PySpark DataFrames
Learn PySpark DataFrame API and its basic operations.
We'll cover the following...
PySpark DataFrames
PySpark DataFrames is a distributed collection of data organized into named columns, similar to a table in a relational database or a DataFrame in R/Python. PySpark DataFrames are an abstraction on top of RDDs and provide a more concise and efficient way to handle structured data. Not only are they easy to understand, but their operations are optimized compared to RDDs. This is because of the inbuilt optimization. DataFrames are immutable, which means that any transformation operation on a DataFrame will create a new DataFrame.
PySpark DataFrames support a wide range of operations, such as filtering, grouping, joining, and aggregation, making it easier to perform complex data operations. They support both SQL queries and expression methods. PySpark DataFrames are implemented in the pyspark.sql module and provide the DataFrame class.
Creating PySpark DataFrames
To use PySpark DataFrames, we first need to create a SparkSession object, which is the entry point to PySpark SQL. Once we create a SparkSession, it’s available in the PySpark shell as spark. There are three methods available for creating PySpark DataFrames:
From existing RDDs
To create a PySpark DataFrame from an existing RDD, we can use the createDataFrame() method provided by the SparkSession object. This method allows us to pass an RDD along with the schema (column names) to create the DataFrame.
from pyspark.sql import SparkSessionspark = SparkSession.builder.appName("pyspark_sql").getOrCreate()print("Create a sample RDD")rdd = spark.sparkContext.parallelize([(1, "Alice"), (2, "Bob"), (3, "Charlie")])print("Create a PySpark DataFrame from RDD")df = spark.createDataFrame(rdd, ["id", "name"])print("Print the contents of the DataFrame")df.show()
Let’s understand the code above:
- Line 1: Import the
SparkSessionclass from thepyspark.sqlmodule. - Line 2: Create a
SparkSessionwith the name “pyspark_sql” using thebuilderpattern and thegetOrCreate()method. - Line 5: Use the
parallelize()method of thesparkContextattribute from theSparkSessionto parallelize a Python list of tuples,[(1,