How to load a parquet file in pandas
What is Parquet?
Apache Parquet is a column-oriented data file format that is open source and designed for data storage and retrieval. It offers high-performance data compression and encoding schemes for handling large amounts of complex data.
The read_parquet method is used to load a parquet file to a data frame.
Note: Refer to What is pandas in Python to learn more about pandas.
Syntax
Here’s the syntax for this:
pandas.read_parquet(path, engine='auto', columns=None, storage_options=None, use_nullable_dtypes=False, **kwargs)
Parameter
path: The file path to the parquet file. The file path can also point to a directory containing multiple files. The file path can also be a valid file URL. Valid URL schemes arehttp,ftp,s3,gs, andfile.engine: This parameter indicates which parquet library to use. Available options areauto,pyarroworfastparquet.columns: This parameter indicates the columns to be read into the data frame.storage_options: Extra options for a certain storage connection, such as host, port, username, password, and so on.use_nullable_dtypes: This is a boolean parameter. IfTrue, use types for the resultant data frame that usespd.NAas the missing value indicator.
Code
Let’s see an example of the read_parquet method in Python.
import pandas as pd
df = pd.read_parquet('data.parquet', engine='pyarrow')
print(df)
cols = ["Name"]
df1 = pd.read_parquet('data.parquet', columns=cols)
print(df1)Explanation
- Line 1:
pandaslibrary is imported. - Line 3: The parquet file
data.parquetis loaded to a pandas data frame i.e.,dfusing theread_parquetmethod. - Line 4:
dfis printed. - Line 6: We define the columns i.e.,
colsto be read into the data frame. - Line 7:
data.parquetfile is read into a pandas data frame calleddf1using theread_parquetmethod and passing cols as the columns to be read rejecting other columns. - Line 8:
df1is printed.