How to drop multiple columns from a PySpark DataFrame
Overview
The drop() method in PySpark drops one or more columns of the DataFrame or dataset.
Syntax
dataframe.drop(*cols)
Parameters
cols- These are the columns to be removed.
Return value
The method returns a new DataFrame after deleting the specified columns.
Example
import pysparkfrom pyspark.sql import SparkSessionspark = SparkSession.builder.appName('edpresso').getOrCreate()data = [("James","Smith","USA","CA"),("Michael","Rose","USA","NY"),("Robert","Williams","USA","CA"),("Maria","Jones","USA","FL")]columns = ["firstname","lastname","country","state"]df = spark.createDataFrame(data = data, schema = columns)print("Initial dataframe")df.show(truncate=False)cols_to_remove = ["country", "firstname"]new_df = df.drop(*cols_to_remove)print("-" * 8)print("Dataframe after removing the columns")new_df.show(truncate=False)
Explanation
-
Line 4: A spark session with the app’s Educative Answers is created.
-
Lines 6–10: We define data for the DataFrame.
-
Line 12: The columns of the DataFrame are defined.
-
Line 13: A DataFrame is created using the
createDataframe()method. -
Lines 14–15: The original or initial DataFrame is printed.
-
Line 17: The columns to be removed are defined as
cols_to_remove. -
Line 19: The columns are dropped by invoking the
drop()method and passing thecols_to_removeparameter. -
Line 24: The new DataFrame with the columns removed is printed.
Free Resources
Copyright ©2026 Educative, Inc. All rights reserved