How to write a DataFrame to a JSON file in PySpark

Apache Spark is an open-source, scalable, ultra-fast, distributed processing framework for large-scale processing and machine learning. It is primarily designed for large data workloads. With Spark, fast, analytical queries can be performed against data of any size while leveraging in-memory caching and optimized query execution.

Apache Spark distributes massive computing tasks to multiple nodes, where they are decomposed into smaller tasks that can be executed simultaneously.

Apache Spark supports Java, Scala, Python, and R APIs. PySpark is a Python interface to Apache Spark. In a distributed processing environment, PySpark allows you to manipulate and analyze data using Python and SQL-like commands.

Let’s walk through the following example showing how to extract a DataFrame from a CSV file and save it to a JSON file.

Code example

The following example aims to familiarize you with the following concepts:

Read a CSV file using PySpark and store the data exported into a DataFrame.
Convert the PySpark DataFrame into a pandas DataFrame.
Save the pandas DataFrame into a JSON file.

Let’s go through the code widget above:

Lines 1–2: We import the required Python libraries: pyspark and pandas.
Lines 4–7: We create a Spark session and set the name of the application to 'demo' and the master program, "local".
Line 9: We change the log level to ERROR to disable DEBUG or INFO messages in the console.
Line 10: Using the spark.read.load() function, we read the test.csv file and load it into a DataFrame.
Line 11: We print out the DataFrame schema using the printSchema() function.
Line 12: We invoke the show() function to display the head of the DataFrame.
Line 14: We convert the PySpark DataFrame into a pandas DataFrame.
Line 15: We call the method to_json() to convert the pandas DataFrame to a JSON file.