How to drop duplicate columns in Pyspark
Duplicate columns in a DataFrame can lead to more memory consumption of the DataFrame and duplicated data. Hence, duplicate columns can be dropped in a spark DataFrame by the following steps:
- Determine which columns are duplicate
- Drop the columns that are duplicate
Determining duplicate columns
Two columns are duplicated if both columns have the same data. Find out the list of duplicate columns.
Dropping duplicate columns
The drop() method can be used to drop one or more columns of a DataFrame in spark.
Instead of dropping the columns, we can select the non-duplicate columns.
Note: To learn more about dropping columns, refer to how to drop multiple columns from a PySpark DataFrame.
Code example
Let’s look at the code below:
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('edpresso').getOrCreate()data = [("James","James","Smith","USA","CA", "USA"),("Michael","Michael","Rose","Russia","Novogrod", "Russia"),("Robert","Robert","Williams","Canada","Ontario", "Canada"),("Maria","Maria","Jones","Australia","Perth", "Australia")]columns = ["firstname","firstname_dup","lastname","country","state","country_duplicate"]df = spark.createDataFrame(data = data, schema = columns)dup_cols = ["country_duplicate", "firstname_dup"]new_df = df.drop(*dup_cols)print("-" * 8)print("Dataframe after removing the duplicate columns")new_df.show(truncate=False)
Code explanation
- Lines 1-2:
pysparkand spark session are imported. - Line 4: A spark session is created.
- Lines 6-13: A DataFrame with duplicate columns is created.
- Line 15: The list of duplicate columns are defined.
- Line 17: New DataFrame with no duplicate columns is obtained by dropping the duplicate columns.
Free Resources
Copyright ©2025 Educative, Inc. All rights reserved