In this shot, we will discuss how to analyze the memory usage of your pandas DataFrame and how to reduce the size of your DataFrame.
Pandas DataFrames are usually kept in memory once they are loaded. Therefore, we will sometimes need to reduce the DataFrame size in order to load it in the memory and work with that DataFrame.
Let’s create a DataFrame and see how much memory it occupies.
import pandas as pddrinks = pd.read_csv('http://bit.ly/drinksbycountry')print(drinks.info(memory_usage='deep'))
drinks
.drinks
. We can see that it currently uses 30.4 KB.Now, if you have a very large dataset, then you could face performance problems while loading the DataFrame, or you might not even be able to load the DataFrame due to memory issues.
We can perform two easy steps during the file reading process to reduce the DataFrame’s size.
The first step is to read the columns that we actually require. Let’s look at the code snippet below:
import pandas as pdcols = ['beer_servings', 'continent']small_drinks = pd.read_csv('http://bit.ly/drinksbycountry',usecols = cols)print(small_drinks.info(memory_usage='deep'))
usecols
parameter.DataFrame
. Now, we can see that it uses only 13.6 KB.Next, the second step that we can take is to convert any object columns that contain some categorical data to the category data type. This reduces the space drastically, as the category data stores only the categorical values internally, but the object data stores each value in the memory.
Let’s look at the below code snippet:
import pandas as pddtypes = {'continent':'category'}cols = ['beer_servings', 'continent']smaller_drinks = pd.read_csv('http://bit.ly/drinksbycountry',usecols = cols,dtype = dtypes)print(smaller_drinks.info(memory_usage='deep'))
continent
column is a category
column and that while reading the data, pandas should read this column as category
instead of object
. Then, in line 7, we pass this as a parameter while reading the data.