How to drop duplicate columns in Pyspark

Duplicate columns in a DataFrame can lead to more memory consumption of the DataFrame and duplicated data. Hence, duplicate columns can be dropped in a spark DataFrame by the following steps:

Determine which columns are duplicate
Drop the columns that are duplicate

Determining duplicate columns

Two columns are duplicated if both columns have the same data. Find out the list of duplicate columns.

Dropping duplicate columns

The drop() method can be used to drop one or more columns of a DataFrame in spark.

Instead of dropping the columns, we can select the non-duplicate columns.

Note: To learn more about dropping columns, refer to how to drop multiple columns from a PySpark DataFrame.

Code example

Let’s look at the code below:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('edpresso').getOrCreate()
data = [("James","James","Smith","USA","CA", "USA"),
    ("Michael","Michael","Rose","Russia","Novogrod", "Russia"),
    ("Robert","Robert","Williams","Canada","Ontario", "Canada"),
    ("Maria","Maria","Jones","Australia","Perth", "Australia")
  ]
columns = ["firstname","firstname_dup","lastname","country","state","country_duplicate"]
df = spark.createDataFrame(data = data, schema = columns)
dup_cols = ["country_duplicate", "firstname_dup"]
new_df = df.drop(*dup_cols)
print("-" * 8)
print("Dataframe after removing the duplicate columns")
new_df.show(truncate=False)

How to drop duplicate columns in Pyspark

Determining duplicate columns

Dropping duplicate columns

Code example

Code explanation