Introduction to Data Input and Output

Learn about the data input and output content flow.

Overview

The most common task encountered in a data analysis project is reading the data and writing it back in a new form, either as a final format or an updated version of the raw data.

Data input and output flow

The flow of data input and output is as follows:

  • Read data into the pandas and PySpark DataFrame.
  • Rename the columns of the DataFrame using the withColumnRenamed method.
  • Select a subset of columns only relevant to our analysis.
  • Write the data back into a disk as a distributed dataset across multiple files using partition and sorting for better performance.

Reading a dataset depends on the provided data source. We might need some preprocessing before we read the data into a PySpark DataFrame. We might get CSV, JSON, or Parquet format as an input source. The dataset we’re using in this course is in JSON format. Therefore, we’ll focus on reading JSON data.

We can also read data from databases to a PySpark DataFrame using the database-specific JDBC or ODBC drivers.

To run PySpark, we first need to initialize its session. Let’s take a quick look at how to do that.

Create a PySpark session

We will write a code to initialize the environment.

Code for creating PySpark session

Let’s write a code to initialize the environment. In the code below, we use the create_spark_session function to create a spark executor with four follower nodes and one leader node. These use five threads to accomplish any PySpark task.

from pyspark.sql import SparkSession
from dotenv import load_dotenv
def create_spark_session():
    """Create a Spark Session"""
    _ = load_dotenv()
    return (
        SparkSession
        .builder
        .appName("SparkApp")
        .master("local[5]")
        .getOrCreate()
    )
spark = create_spark_session()
print('Session Started')
print('Code Executed Successfully')
Create PySpark session

Explanation

  • Lines 1–2: We import the required library, SparkSession. We use it to create a PySpark session.

  • Lines 3–12: We make a function to create a PySpark session, create_spark_session.

    • Line 3: We define the function.
    • Line 5: We load the environment.
    • Lines 6–12: We return a PySpark session.
      • Line 9: We assign the name of the session, "SparkApp”.
      • Line 10: We create five threads as logical cores on our machine locally.
  • Line 13: We call the function to create a PySpark session.

  • Line 14: We print that our session has started.

After a successful code execution, we’ll see the message “Code Executed Successfully” in the terminal.