How to Add Group-Level Summary Statistic as a New Column in Pandas?

In this post, we will see an example adding results from one of aggregating functions like mean/median after group_by() on a specific column as a new column. In other words, we might have group-level summary values for a column and we might to add the summary values back to the original dataframe we computed group-level summary.

import pandas as pd

Let us work with gapminder data to show how to add grouped mean as a new column.

data_url = 'http://bit.ly/2cLzoxH'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)
print(gapminder.head(3))

       country  year         pop continent  lifeExp   gdpPercap
0  Afghanistan  1952   8425333.0      Asia   28.801  779.445314
1  Afghanistan  1957   9240934.0      Asia   30.332  820.853030
2  Afghanistan  1962  10267083.0      Asia   31.997  853.100710

Let us say we want to add mean lifeExp values per each continent as a new column to the gapminder data frame. The values of the new column will be the same for within each continent. We know that Pandas aggregating function mean() can compute mean after group_by() on continent.

lifeExp_per_continent = gapminder.groupby('continent').lifeExp.mean()

lifeExp_per_continent
continent
Africa      48.865330
Americas    64.658737
Asia        60.064903
Europe      71.903686
Oceania     74.326208
Name: lifeExp, dtype: float64

How To Add Group Level Mean as New Column with Pandas transform() function?

Here we want to add these mean lifeExp values per continent to the gapminder dataframe. There are multiple ways to do that in Pandas. Here we will use Pandas transform() funtion to compute mean values and add it to the original dataframe.

We will first groupby() on continent and extract lifeExp values and apply transform() function to compute mean.

gapminder.groupby('continent').lifeExp.transform('mean')

0       60.064903
1       60.064903
2       60.064903
3       60.064903
4       60.064903
          ...    
1699    48.865330
1700    48.865330
1701    48.865330
1702    48.865330
1703    48.865330
Name: lifeExp, Length: 1704, dtype: float64

We can see that it creates a Series of the same length as our gaominder dataframe. We can add that as a new column to get what we wanted.

gapminder['lifeExp_mean'] = gapminder.groupby('continent').lifeExp.transform('mean')
country	year	pop	continent	lifeExp	gdpPercap	lifeExp_mean
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314	60.064903
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030	60.064903
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710	60.064903

How To Add Group Level Mean as New Column with Pandas map() function ?

Another way to add group-level mean as a new column is to use Pandas map() function and dictionary.

We first apply groupby and get group-level summary statistics, either mean or median. Then convert the summary dataframe to a dictionary. In our example, we have mean lifeExp per continent as a dataframe and we convert that into a dictionary using Pandas to_dict() function.

mean_dict = lifeExp_per_continent.to_dict()
mean_dict
{'Africa': 48.86533012820508,
 'Americas': 64.65873666666667,
 'Asia': 60.064903232323175,
 'Europe': 71.90368611111106,
 'Oceania': 74.32620833333333}

Now we can use map() function and provide the dictionary as argument to create a new column.

gapminder['mean'] = gapminder['continent'].map(mean_dict)

We can see the new column with mean lifeExp values per continent and it is the same as the previous approach.

	country	year	pop	continent	lifeExp	gdpPercap	lifeExp_mean	mean
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314	60.064903	60.064903
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030	60.064903	60.064903
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710	60.064903	60.064903