What is the left anti join in PySpark?
The left anti join in PySpark is similar to the join functionality, but it returns only columns from the left DataFrame for non-matched records.
Syntax
DataFrame.join(<right_Dataframe>, on=None, how="leftanti")
OR
DataFrame.join(<right_Dataframe>, on=None, how="left_anti")
Parameters
The Dataframe above represents the left side (or left DataFrame) of the join operation.
<right_Dataframe>- It represents the right side (or right DataFrame) of the join operation.on- The column or a list of column names.how- This indicates the type of the join operation. As
Example
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('edpresso').getOrCreate()data = [("Sam", "USA", "cs150", 23),("Jolie", "UK", "mech421", 19),("Gaby", "Canada", "botany456", 26),("Celeste", "Australia", "cs150", 22)]columns = ["student_name","country","course_id","age"]df_1 = spark.createDataFrame(data = data, schema = columns)data = [("Computer Science", "cs150"),("Mechanical Engineering", "mech421")]columns = ["course_name","course_id"]df_2 = spark.createDataFrame(data = data, schema = columns)df_left_anti = df_1.join(df_2, on="course_id", how="leftanti")df_left_anti.show(truncate=False)
Explanation
- Lines 1–2: Import the
pysparkandSparkSession. - Line 4: We create a SparkSession with the application name
edpresso. - Lines 6–9: We define the dummy data for the first DataFrame.
- Line 10: We define the columns for the first DataFrame.
- Line 11: We create the first spark DataFrame
df_1with the dummy data in lines 6–9 and the columns in line 11. - Lines 13–17: The second DataFrame
df_2is created. - Line 19: We apply the left anti join between the
df_1anddf_2datasets. - Line 21: We simply display the output.