How to write a DataFrame to a Parquet file in Python
Overview
Apache Parquet is a column-oriented, open-source data file format for data storage and retrieval. It offers high-performance data compression and encoding schemes to handle large amounts of complex data.
We use the to_parquet() method in Python to write a DataFrame to a Parquet file.
Note: Refer to What is pandas in Python? to learn more about pandas.
Syntax
DataFrame.to_parquet(path=None, engine='auto', compression='snappy', index=None, partition_cols=None, storage_options=None, **kwargs)
Parameters
path: This is the path to the Parquet file.engine: This parameter indicates which Parquet library to use. The available options areauto,pyarrow, andfastparquet.compression: This parameter indicates the type of compression to use. The available options aresnappy,gzip, andbrotli. The default compression issnappy.index: This is a boolean parameter. IfTrue, the DataFrame’s indexes are written to the file. IfFalse, the indexes are ignored.partition_cols: These are the names of the columns that partition the DataFrame. The order in which the columns are given determines the order in which they are partitioned.storage_options: These are the extra options for a certain storage connection, such as a host, port, username, password, and so on.
Example
import pandas as pdimport osdata = [['dom', 10], ['abhi', 15], ['celeste', 14]]df = pd.DataFrame(data, columns = ['Name', 'Age'])df.to_parquet("dataframe.parquet")print("Listing the contents of the current directory:")print(os.listdir('.'))
Explanation
- Lines 1–2: We import the
pandasandospackages. - Line 4: We define the
datafor constructing the pandas dataframe. - Line 6: We convert
datato a pandas DataFrame calleddf. - Line 8: We write
dfto a Parquet file using theto_parquet()function. The resulting file name asdataframe.parquet. - Lines 10–11: We list the items in the current directory using the
os.listdirmethod. We observe that thedataframe.parquetfile is created.