What is the left anti join in PySpark?

Syntax

DataFrame.join(<right_Dataframe>, on=None, how="leftanti")

DataFrame.join(<right_Dataframe>, on=None, how="left_anti")

Parameters

The Dataframe above represents the left side (or left DataFrame) of the join operation.

<right_Dataframe> - It represents the right side (or right DataFrame) of the join operation.
on - The column or a list of column names.
how - This indicates the type of the join operation. As

Example

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('edpresso').getOrCreate()
data = [("Sam", "USA", "cs150", 23),
        ("Jolie", "UK", "mech421", 19),
        ("Gaby", "Canada", "botany456", 26),
        ("Celeste", "Australia", "cs150", 22)]
columns = ["student_name","country","course_id","age"]
df_1 = spark.createDataFrame(data = data, schema = columns)
data = [("Computer Science", "cs150"),
        ("Mechanical Engineering", "mech421")
]
columns = ["course_name","course_id"]
df_2 = spark.createDataFrame(data = data, schema = columns)
df_left_anti = df_1.join(df_2, on="course_id", how="leftanti")
df_left_anti.show(truncate=False)

Explanation

Lines 1–2: Import the pyspark and SparkSession.
Line 4: We create a SparkSession with the application name edpresso.
Lines 6–9: We define the dummy data for the first DataFrame.
Line 10: We define the columns for the first DataFrame.
Line 11: We create the first spark DataFrame df_1 with the dummy data in lines 6–9 and the columns in line 11.
Lines 13–17: The second DataFrame df_2 is created.
Line 19: We apply the left anti join between the df_1 and df_2 datasets.
Line 21: We simply display the output.

Free Resources

License: Creative Commons-Attribution NonCommercial-ShareAlike 4.0 (CC-BY-NC-SA 4.0)