How is a query written using the PySpark.SQL module?

We can utilize PySpark sessions called SparkSession to generate DataFrames. These DataFrames can be registered as tables, after which data can be fetched and added by SQL queries using the PySpark.SQL module.

Syntax

filtered = session.sql(query)

The syntax above can be used to perform SQL queries on tables using PySpark. The variables mentioned in the code indicate the following:

  1. session : This is the instance of the SparkSession which contains the function sql() .
  2. query : It is a string-based SQL query. It contains the name of the table(s) to which we apply the SQL query.
  3. filtered : This is the resulting table after applying the SQL query on a given table.

Code example

Let's look at the code below:

import pyspark
from pyspark.sql import SparkSession
print("SparkSession version: ", SparkSession.version)
session = \
SparkSession.builder.master("local[1]") \
  .appName('Pyspark') \
  .getOrCreate()
print(session)

# Creating a DataFrame
df = session.createDataFrame([("Apples", 10), ("Mangos", 20), ("Lemons", 3)])
df.show()
# Spark SQL Query
df.createOrReplaceTempView("table1")
filtered = session.sql("SELECT _1 FROM table1")
filtered.show() # SQL command applied to df
SQL query

Code explanation

  • Lines 4-8: We create a SparkSession.
  • Line 11: We create a data frame containing two columns and three rows.
  • Line 14: We create a temporary view of the created data frame. This view will act like a table where we can apply SQL queries.
  • Line 15: We apply an SQL query on the temporary view.

Free Resources

Copyright ©2024 Educative, Inc. All rights reserved