Speed up File Loading

In this lesson, we show how to speed up file loading.

You may notice one thing, if you load a large CSV file into the DataFrame, the object may be very slow. It’s a time-consuming operation. If your file is a static file, it won’t change frequency. If loading this file frequently and doing data analysis is part of your job, then reducing file load time would be a very useful operation.

Export DataFrame object to hdf file format

There is a method to export your static file to a binary format, such as the hdf format. Hierarchical Data Format (HDF) is a set of file formats (HDF4, HDF5) designed to store and organize large amounts of data. By using this format, it can effectively reduce file load time.

Let’s see an example. Because of the limitation of this site, a file more than 2MB is not allowed. So, the code here is not executable.

import pandas as pd
import numpy as np
import timeit

# Let's create a matrix, which is 200000 * 20, and create a DataFrame object from it.
d = np.random.randint(-10, 10, size=(200000, 20))
df = pd.DataFrame(d)

# Export the data to two files, one is CSV format, another one is HDF format.
df.to_csv("output/data.csv")
df.to_hdf("output/data.hdf", key="df")

# We use timeit to record the running time between the start and stop.
# In this section, we read the file from CSV file, and print the running time.
start = timeit.default_timer()
df1 = pd.read_csv("output/data.csv")
stop = timeit.default_timer()
print('Loading data.csv file time: {}'.format(stop - start))

# In this section, we read the file from HDF file, and print the running time.
start = timeit.default_timer()
df2 = pd.read_hdf("output/data.hdf")
stop = timeit.default_timer()
print('Loading data.hdf file time: {}'.format(stop - start))

Notice: File read performance depends on the environment, and the following data is the result of running it on my own PC.

Loading data.csv file time: 0.3129s
Loading data.hdf file time: 0.0652s

As you can see, the loading time for an HDF format is one-fifth that of a CSV.

In addition to the HDF format, there are other formats to choose from, such as pickle and gbq.

Get hands-on with 1200+ tech skills courses.