The left anti join in PySpark is similar to the join functionality, but it returns only columns from the left DataFrame for non-matched records.
DataFrame.join(<right_Dataframe>, on=None, how="leftanti")
OR
DataFrame.join(<right_Dataframe>, on=None, how="left_anti")
The Dataframe
above represents the left side (or left DataFrame) of the join operation.
<right_Dataframe>
- It represents the right side (or right DataFrame) of the join operation.on
- The column or a list of column names.how
- This indicates the type of the join operation. Asimport pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('edpresso').getOrCreate()data = [("Sam", "USA", "cs150", 23),("Jolie", "UK", "mech421", 19),("Gaby", "Canada", "botany456", 26),("Celeste", "Australia", "cs150", 22)]columns = ["student_name","country","course_id","age"]df_1 = spark.createDataFrame(data = data, schema = columns)data = [("Computer Science", "cs150"),("Mechanical Engineering", "mech421")]columns = ["course_name","course_id"]df_2 = spark.createDataFrame(data = data, schema = columns)df_left_anti = df_1.join(df_2, on="course_id", how="leftanti")df_left_anti.show(truncate=False)
pyspark
and SparkSession.
edpresso
.df_1
with the dummy data in lines 6–9 and the columns in line 11.df_2
is created.df_1
and df_2
datasets.