How to reduce the size of a DataFrame in pandas

In this shot, we will discuss how to analyze the memory usage of your pandas DataFrame and how to reduce the size of your DataFrame.

Pandas DataFrames are usually kept in memory once they are loaded. Therefore, we will sometimes need to reduce the DataFrame size in order to load it in the memory and work with that DataFrame.

Let’s create a DataFrame and see how much memory it occupies.

import pandas as pd
drinks = pd.read_csv('http://bit.ly/drinksbycountry')
print(drinks.info(memory_usage='deep'))

Explanation

  • In line 1, we import the required package.
  • In line 3, we read the data in a DataFrame and name it drinks.
  • In line 4, we print the memory usage of our DataFrame drinks. We can see that it currently uses 30.4 KB.

Now, if you have a very large dataset, then you could face performance problems while loading the DataFrame, or you might not even be able to load the DataFrame due to memory issues.

We can perform two easy steps during the file reading process to reduce the DataFrame’s size.

The first step is to read the columns that we actually require. Let’s look at the code snippet below:

import pandas as pd
cols = ['beer_servings', 'continent']
small_drinks = pd.read_csv('http://bit.ly/drinksbycountry',
usecols = cols)
print(small_drinks.info(memory_usage='deep'))

Explanation

  • The code is almost the same. The only differences are in line 3, where we specify the column names that we are actually interested in; and in line 4, where we pass those column names as the usecols parameter.
  • In line 6, we print the memory usage of our DataFrame. Now, we can see that it uses only 13.6 KB.

Next, the second step that we can take is to convert any object columns that contain some categorical data to the category data type. This reduces the space drastically, as the category data stores only the categorical values internally, but the object data stores each value in the memory.

Let’s look at the below code snippet:

import pandas as pd
dtypes = {'continent':'category'}
cols = ['beer_servings', 'continent']
smaller_drinks = pd.read_csv('http://bit.ly/drinksbycountry',
usecols = cols,
dtype = dtypes)
print(smaller_drinks.info(memory_usage='deep'))

Explanation

  • The code is almost the same. The only difference is in line 4, where we specify that continent column is a category column and that while reading the data, pandas should read this column as category instead of object. Then, in line 7, we pass this as a parameter while reading the data.
  • In line 8, we print the memory usage of the DataFrame. We can see that it uses only 2.3 KB!