How to save a PySpark DataFrame to a CSV file

The df.write.csv() method is used to write a DataFrame to a CSV file. Various different options related to the write operation can be specified via the df.write.option() method.

Syntax

df.write.option("option_name", "option_value").csv(file_path)

Parameter

file_path: Denotes the path where the csv file to be created.

Example

import pyspark, os
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('answer').getOrCreate()

data = [("James","Educative","Engg","USA"),
    ("Michael","Google",None,"Asia"),
    ("Robert",None,"Marketing","Russia"),
    ("Maria","Netflix","Finance","Ukraine"),
    (None, None, None, None)
  ]

columns = ["emp name","company","department","country"]
df = spark.createDataFrame(data = data, schema = columns)

csv_file_path = "data.csv"
df.write.option("header", True).option("delimiter",",").csv(csv_file_path)

Code

Explanation

Lines 1–2: The pyspark DataFrame and SparkSession is imported.
Line 4: We create a SparkSession with the application name answer.
Lines 6–11: We define the dummy data for the DataFrame.
Line 13: We define the columns for the dummy data.
Line 14: We create a spark DataFrame with the dummy data defined above.
Line 16: The CSV file path where the CSV file to be generated is defined.
Line 17: The DataFrame is written to a CSV file by invoking the write.csv() function on the DataFrame object.