How to drop multiple columns from a PySpark DataFrame

Overview

The drop() method in PySpark drops one or more columns of the DataFrame or dataset.

Syntax

dataframe.drop(*cols)

Parameters

cols - These are the columns to be removed.

Return value

The method returns a new DataFrame after deleting the specified columns.

Example

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('edpresso').getOrCreate()
data = [("James","Smith","USA","CA"),
    ("Michael","Rose","USA","NY"),
    ("Robert","Williams","USA","CA"),
    ("Maria","Jones","USA","FL")
  ]
columns = ["firstname","lastname","country","state"]
df = spark.createDataFrame(data = data, schema = columns)
print("Initial dataframe")
df.show(truncate=False)
cols_to_remove = ["country", "firstname"]
new_df = df.drop(*cols_to_remove)
print("-" * 8)
print("Dataframe after removing the columns")
new_df.show(truncate=False)

Explanation

Line 4: A spark session with the app’s Educative Answers is created.
Lines 6–10: We define data for the DataFrame.
Line 12: The columns of the DataFrame are defined.
Line 13: A DataFrame is created using the createDataframe() method.
Lines 14–15: The original or initial DataFrame is printed.
Line 17: The columns to be removed are defined as cols_to_remove.
Line 19: The columns are dropped by invoking the drop() method and passing the cols_to_remove parameter.
Line 24: The new DataFrame with the columns removed is printed.

Free Resources