Read Data into DataFrame
Explore methods to read JSON data into Pandas and PySpark DataFrames, including custom and built-in functions. Understand the memory management differences between Pandas and PySpark, and learn how PySpark handles schema inference and distributed data. This lesson prepares you to efficiently load and process datasets using both libraries.
Read data in the Pandas
There are many ways to read data in pandas, but in this lesson, we’ll focus on the following two ways:
-
Read with custom code then convert it to a pandas DataFrame.
-
Read with a built-in pandas function.
Note: The codes discussed below are executable.
Using custom code
import pandas as pd
from tqdm import tqdm
import json
PATH_BIGDATA = '/Toys_and_Games_5.json'
def read_json_to_pdf(path: str) -> pd.DataFrame:
data = []
with open(path, 'r') as f:
for line in tqdm(f):
data.append(json.loads(line))
df = pd.DataFrame(data)
return df
raw_pdf = read_json_to_pdf(PATH_BIGDATA)
print(raw_pdf.head())
print('Code Executed Successfully')After a successful code execution, we’ll see the message “Code Executed Successfully” in the terminal.
Explanation
-
Lines 1–3: We import the required library for reading the dataset.
-
Line 4: We set the path of our dataset.
-
Lines 5–11 ...