What is the left anti join in PySpark?

The left anti join in PySpark is similar to the join functionality, but it returns only columns from the left DataFrame for non-matched records.

Syntax

DataFrame.join(<right_Dataframe>, on=None, how="leftanti")

OR

DataFrame.join(<right_Dataframe>, on=None, how="left_anti")

Parameters

The Dataframe above represents the left side (or left DataFrame) of the join operation.

  • <right_Dataframe> - It represents the right side (or right DataFrame) of the join operation.
  • on - The column or a list of column names.
  • how - This indicates the type of the join operation. As

Example

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('edpresso').getOrCreate()
data = [("Sam", "USA", "cs150", 23),
("Jolie", "UK", "mech421", 19),
("Gaby", "Canada", "botany456", 26),
("Celeste", "Australia", "cs150", 22)]
columns = ["student_name","country","course_id","age"]
df_1 = spark.createDataFrame(data = data, schema = columns)
data = [("Computer Science", "cs150"),
("Mechanical Engineering", "mech421")
]
columns = ["course_name","course_id"]
df_2 = spark.createDataFrame(data = data, schema = columns)
df_left_anti = df_1.join(df_2, on="course_id", how="leftanti")
df_left_anti.show(truncate=False)

Explanation

  • Lines 1–2: Import the pyspark and SparkSession.
  • Line 4: We create a SparkSession with the application name edpresso.
  • Lines 6–9: We define the dummy data for the first DataFrame.
  • Line 10: We define the columns for the first DataFrame.
  • Line 11: We create the first spark DataFrame df_1 with the dummy data in lines 6–9 and the columns in line 11.
  • Lines 13–17: The second DataFrame df_2 is created.
  • Line 19: We apply the left anti join between the df_1 and df_2 datasets.
  • Line 21: We simply display the output.