We can utilize PySpark sessions called SparkSession
to generate DataFrames. These DataFrames can be registered as tables, after which data can be fetched and added by SQL queries using the PySpark.SQL
module.
filtered = session.sql(query)
The syntax above can be used to perform SQL queries on tables using PySpark. The variables mentioned in the code indicate the following:
session
: This is the instance of the SparkSession
which contains the function sql()
.query
: It is a string-based SQL query. It contains the name of the table(s) to which we apply the SQL query.filtered
: This is the resulting table after applying the SQL query on a given table.Let's look at the code below:
import pyspark from pyspark.sql import SparkSession print("SparkSession version: ", SparkSession.version) session = \ SparkSession.builder.master("local[1]") \ .appName('Pyspark') \ .getOrCreate() print(session) # Creating a DataFrame df = session.createDataFrame([("Apples", 10), ("Mangos", 20), ("Lemons", 3)]) df.show() # Spark SQL Query df.createOrReplaceTempView("table1") filtered = session.sql("SELECT _1 FROM table1") filtered.show() # SQL command applied to df
SparkSession
.Free Resources