Reduce the memory usage when loading a file in pandas

When we load a file into a pandas DataFrame object, we may find that it consumes more memory than we thought. There are two reasons for this:

Some unnecessary fields are also loaded.
The default field types are the most memory-consuming, e.g., int64 is the default type for integer field.

import numpy as np
import pandas as pd
import os
# At first, we create a dataset with 20000 rows and 10 columns.
# Meanwhile, we assign 10 column names for these 10 columns.
d = np.random.randint(0, 20, size=(20000, 10))
df = pd.DataFrame(d,
                  columns=["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"])
# Export this dataset to a csv file with sep=`\t` and without index.
df.to_csv("output/raw.csv", sep='\t', index=False)
# At first, we load all columns from this file.
# Then print the information of this dataframe object.
full_df = pd.read_csv("output/raw.csv", sep='\t')
print(full_df.info())
print("----------------------------------------------------------------")
# Then, we load this file again, but with only 3 fields.
# Then print the information of this dataframe object again.
less_df = pd.read_csv("output/raw.csv", sep='\t', usecols=["a", "b", "c"])
print(less_df.info())
os.remove("output/raw.csv")

import numpy as np
import pandas as pd
import os
# At first, we create a dataset with 20000 rows and 10 columns.
# Meanwhile, we assign 10 column names for these 10 columns.
d = np.random.randint(0, 20, size=(20000, 10))
df = pd.DataFrame(d,
                  columns=["a", "b", "c", "d", "e", "f", "g", "h", "i", "j"])
# Export this dataset to a csv file with sep=`\t` and without index.
df.to_csv("output/raw.csv", sep='\t', index=False)
# At first, we load all columns from this file.
# Then print the information of this dataframe object.
full_df = pd.read_csv("output/raw.csv", sep='\t')
print(full_df.info())
print("----------------------------------------------------------------")
# Specify data type for each column.
# The key is column name, value is data type
dtype = {
    "a": 'uint8',
    "b": 'uint8',
    "c": 'uint8',
    "d": 'uint8',
    "e": 'uint8',
    "f": 'uint8',
    "9": 'uint8',
    "h": 'uint8',
    "i": 'uint8',
    "j": 'uint8'
}
# Then, we load this file again, but specify the data type for each column.
# Then print the information of this dataframe object again.
less_df = pd.read_csv("output/raw.csv", sep='\t', dtype=dtype)
print(less_df.info())
os.remove("output/raw.csv")

Reduce the memory usage when loading a file in pandas

Reduce memory by loading selected columns

Reduce memory by specifying column types