How to load an orc file in pandas
What is ORC Format?
The ORC stands for Optimized Row Columnar. ORC is a highly efficient row columnar data format used to read, write, and process data in Hive. ORC files are made of data stripes, each of which comprises an index, row data, and a footer.
The read_orc method is used to load an orc file to a DataFrame.
Note: Refer to What is pandas in Python to learn more about pandas.
Syntax
pandas.read_orc(path, columns=None, **kwargs)
Parameter
path: This is the location/path of the orc file. A directory with many files can be referenced by the file path. The file path can also be a legitimate file URL. The acceptable URL schemes arehttp,ftp,s3,gs, andfile.columns: These are the columns to be read into the DataFrame.
Code example
Let’s look at the code below:
import pandas as pdimport pyarrow.orc# Creating an orc filedf = pd.DataFrame(data={"Name": ["John", "Kelly"], "Age": [3, 4]})df.to_orc("./df.orc")# Reading an orc filedf = pd.read_orc("df.orc")print(df)# Selecting a column from an orc filecols = ["Name"]df1 = pd.read_orc("df.orc", columns=cols)print(df1)
Code explanation
- Lines 1-2 :
pandasandpyarrowpackages are imported. - Lines 4-5 : A DataFrame is created and written to a file named
df.oc - Line 8: The
df.orcfile is read into a pandas data frame calleddfusing theread_orcmethod. - Line 9: The
dfis printed. - Line 12: We define the columns,
cols, to be read into the data frame. - Line 13: The
df.orcfile is read into a pandas DataFrame calleddf1using theread_orcmethod and passingcolsas the columns to be read, rejecting other columns. - Line 14: The
df1is printed.