Trusted answers to developer questions
Trusted Answers to Developer Questions

Related Tags

pyspark
left anti join
python
community creator

What is the left anti join in PySpark?

Abhilash

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

The left anti join in PySpark is similar to the join functionality, but it returns only columns from the left DataFrame for non-matched records.

Syntax

DataFrame.join(<right_Dataframe>, on=None, how="leftanti")

OR

DataFrame.join(<right_Dataframe>, on=None, how="left_anti")

Parameters

The Dataframe above represents the left side (or left DataFrame) of the join operation.

  • <right_Dataframe> - It represents the right side (or right DataFrame) of the join operation.
  • on - The column or a list of column names.
  • how - This indicates the type of the join operation. As

Example

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('edpresso').getOrCreate()
data = [("Sam", "USA", "cs150", 23),
("Jolie", "UK", "mech421", 19),
("Gaby", "Canada", "botany456", 26),
("Celeste", "Australia", "cs150", 22)]
columns = ["student_name","country","course_id","age"]
df_1 = spark.createDataFrame(data = data, schema = columns)
data = [("Computer Science", "cs150"),
("Mechanical Engineering", "mech421")
]
columns = ["course_name","course_id"]
df_2 = spark.createDataFrame(data = data, schema = columns)
df_left_anti = df_1.join(df_2, on="course_id", how="leftanti")
df_left_anti.show(truncate=False)

Explanation

  • Lines 1–2: Import the pyspark and SparkSession.
  • Line 4: We create a SparkSession with the application name edpresso.
  • Lines 6–9: We define the dummy data for the first DataFrame.
  • Line 10: We define the columns for the first DataFrame.
  • Line 11: We create the first spark DataFrame df_1 with the dummy data in lines 6–9 and the columns in line 11.
  • Lines 13–17: The second DataFrame df_2 is created.
  • Line 19: We apply the left anti join between the df_1 and df_2 datasets.
  • Line 21: We simply display the output.

RELATED TAGS

pyspark
left anti join
python
community creator

Grokking Modern System Design Interview for Engineers & Managers

Ace your System Design Interview and take your career to the next level. Learn to handle the design of applications like Netflix, Quora, Facebook, Uber, and many more in a 45-min interview. Learn the RESHADED framework for architecting web-scale applications by determining requirements, constraints, and assumptions before diving into a step-by-step design process.

Keep Exploring