How to write a DataFrame to a JSON file in PySpark
Apache Spark is an open-source, scalable, ultra-fast, distributed processing framework for large-scale processing and machine learning. It is primarily designed for large data workloads. With Spark, fast, analytical queries can be performed against data of any size while leveraging in-memory caching and optimized query execution.
Apache Spark distributes massive computing tasks to multiple nodes, where they are decomposed into smaller tasks that can be executed simultaneously.
Apache Spark supports Java, Scala, Python, and R APIs. PySpark is a Python interface to Apache Spark. In a distributed processing environment, PySpark allows you to manipulate and analyze data using Python and SQL-like commands.
Let’s walk through the following example showing how to extract a DataFrame from a CSV file and save it to a JSON file.
Code example
The following example aims to familiarize you with the following concepts:
Read a CSV file using PySpark and store the data exported into a DataFrame.
Convert the PySpark DataFrame into a pandas DataFrame.
Save the pandas DataFrame into a JSON file.
from pyspark.sql import SparkSessionimport pandas as pdspark = SparkSession.builder \.appName('demo') \.master("local") \.getOrCreate()spark.sparkContext.setLogLevel("ERROR")df = spark.read.load("test.csv",format="csv", sep=",",inferSchema="true",header="true")df.printSchema()df.show()pandas_df = df.toPandas()pandas_df.to_json('/usercode/output/test.json')
Let’s go through the code widget above:
Lines 1–2: We import the required Python libraries:
pysparkandpandas.Lines 4–7: We create a Spark session and set the name of the application to
'demo'and the master program,"local".Line 9: We change the log level to
ERRORto disableDEBUGorINFOmessages in the console.Line 10: Using the
spark.read.load()function, we read thetest.csvfile and load it into a DataFrame.Line 11: We print out the DataFrame schema using the
printSchema()function.Line 12: We invoke the
show()function to display the head of the DataFrame.Line 14: We convert the PySpark DataFrame into a pandas DataFrame.
Line 15: We call the method
to_json()to convert the pandas DataFrame to a JSON file.
Free Resources