Getting to know how much memory used by a Pandas dataframe can be extremely useful when working with bigger dataframe. In this post we will see two examples of estimating memory usage of a Pandas dataframe using Pandas functionalities. We will first see how to find the total memory usage of Pandas dataframe using Pandas info() function and then we will see an example of finding memory usage of all the variables in the dataframe using Pandas memory_usage() function.
Let us load Pandas first and check its version.
import pandas as pd pd.__version__ 1.0.0
We will use a dataset from TidyTuesday project and this data set is on college tuition cost across USA.
data_url="https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-03-10/tuition_cost.csv" df = pd.read_csv(data_url) df.iloc[0:5,0:5]
We can see that our dataframe contain different datatypes.
name state state_code type degree_length 0 Aaniiih Nakoda College Montana MT Public 2 Year 1 Abilene Christian University Texas TX Private 4 Year 2 Abraham Baldwin Agricultural College Georgia GA Public 2 Year 3 Academy College Minnesota MN For Profit 2 Year 4 Academy of Art University California CA For Profit 4 Year
Total Memory Usage of Pandas Dataframe with info()
We can use Pandas info() function to find the total memory usage of a dataframe. Pandas info() function is mainly used for information about each of the columns, their data types, and how many values are not null for each variable. Pandas info() fnction also gives us the memory usage at the end of its report.
To get the full memory usage, we provide memory_usage=”deep” argument to info().
df.info(memory_usage="deep")
We get all basic information about the dataframe and towards the end we also get the “memory usage: 1.1 MB” for the data frame.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2973 entries, 0 to 2972 Data columns (total 10 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 name 2973 non-null object 1 state 2921 non-null object 2 state_code 2973 non-null object 3 type 2973 non-null object 4 degree_length 2973 non-null object 5 room_and_board 1879 non-null float64 6 in_state_tuition 2973 non-null int64 7 in_state_total 2973 non-null int64 8 out_of_state_tuition 2973 non-null int64 9 out_of_state_total 2973 non-null int64 dtypes: float64(1), int64(4), object(5) memory usage: 1.1 MB
Memory Usage of Each Column in Pandas Dataframe with memory_usage()
Pandas info() function gave the total memory used by a dataframe. However, sometimes you may want memory used by each column in a Pandas dataframe.
We can get each column/variable level memory usage using Pandas memory_usage() function.
df.memory_usage()
We get memory used by each column/variable in bytes. By default, memory_usage() ignores the memory footprint of variables with data type object.
Index 128 name 23784 state 23784 state_code 23784 type 23784 degree_length 23784 room_and_board 23784 in_state_tuition 23784 in_state_total 23784 out_of_state_tuition 23784 out_of_state_total 23784 dtype: int64
We can get memory usage iuncluding object datatype using the argument deep=True to memory_usage() function.
df.memory_usage(deep=True)
We get bytes used by each variable, but this time it gives the memory use of object data types.
Index 128 name 248346 state 193391 state_code 175407 type 189007 degree_length 187298 room_and_board 23784 in_state_tuition 23784 in_state_total 23784 out_of_state_tuition 23784 out_of_state_total 23784 dtype: int64
Since memory_usage() function returns a dataframe of memory usage, we can sum it to get the total memory used.
df.memory_usage(deep=True).sum() 1112497
We can see that memory usage estimated by Pandas info() and memory_usage() with deep=True option matches. Typically, object variables can have large memory footprint. By converting object variable of type string to categorical, one can reduce memory footprint.
This post is part of the series on Pandas 101, a tutorial covering tips and tricks on using Pandas for data munging and analysis.