Fun with Pandas Groupby, Aggregate, Multi-Index and Unstack

Fun with Pandas Groupby, Agg,
Fun with Pandas Groupby, Agg,

Fun with Pandas Groupby, Agg,
This post is titled as “fun with Pandas Groupby, aggregate, and unstack”, but it addresses some of the pain points I face when doing mundane data-munging activities. Every time I do this I start from scratch and solved them in different ways. The purpose of this post is to record at least a couple of solutions so I don’t have to go through the pain again.

The high level problem is pretty simple and it goes something like this. You have a dataframe and want to groupby more than one variables, compute some summarized statistics using the remaining variables and use them to do some analysis. Typically plotting something really quick. You can easily imagine a number of variants of this problems. One of the pain points for me is lack of full understanding multi-indexing operations that Pandas enables. So far I have skipped dealing with multi-indexes and do not see myself confronting anytime soon :-). Along the way I have discovered the use of Pandas’ unstack() function multiple times. It is useful for pivot like operation.

Let us work through an example of this with gapminder dataset.

# load pandas
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

We will load gapminder dataset directly from github page.

p2data = "https://raw.githubusercontent.com/cmdlinetips/data/master/gapminder-FiveYearData.csv"
gapminder=pd.read_csv(p2data)
gapminder.head()
country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106

Pandas groupby() on multiple variables

Let us groupby two variables and perform computing mean values for the rest of the numerical variables.

gapminder.groupby(["continent","year"])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1a204ecf10>

One of the ways to compute mean values for remaining variables is to use mean() function directly on the grouped object.

df = gapminder.groupby(["continent","year"]).mean().head()
df.head()

When we perform groupby() operation with multiple variables, we get a dataframe with multiple indices as shown below. We have two indices followed by three columns with average values, but with the original column names.

Pandas groupby multiple variables and summarize with_mean

We can use the columns to get the column names. Note that it gives three column names, not the first two index names.

df.columns
Index(['pop', 'lifeExp', 'gdpPercap'], dtype='object')

Pandas reset_index() to convert Multi-Index to Columns

We can simplify the multi-index dataframe using reset_index() function in Pandas. By default, Pandas reset_index() converts the indices to columns.

df.reset_index()
	continent	year	pop	lifeExp	gdpPercap
0	Africa	1952	4.570010e+06	39.135500	1252.572466
1	Africa	1957	5.093033e+06	41.266346	1385.236062
2	Africa	1962	5.702247e+06	43.319442	1598.078825
3	Africa	1967	6.447875e+06	45.334538	2050.363801
4	Africa	1972	7.305376e+06	47.450942	2339.615674

Pandas agg() function to summarize grouped data

Now the simple dataframe is ready for further downstream analysis. One nagging issue is that using mean() function on grouped dataframe has the same column names. Although now we have mean values of the three columns. One can manually change the column names. Another option is to use Pandas agg() function instead of mean().

With agg() function, we need to specify the variable we need to do summary operation. In this example, we have three variables and we want to compute mean. We can specify that as a dictionary to agg() function.

df =gapminder.groupby(["continent","year"]).agg({'pop': ["mean"], 'lifeExp': ["mean"],'gdpPercap':['mean'] })
df.head()

Now we get mean population, life expectancy, gdpPercap for each year and continent. We again get a multi-indexed dataframe with continent and year as indices and three columns. And it looks like this.

Pandas grouplby multiple variables: mean with agg

Accessing Column Names and Index names from Multi-Index Dataframe

Let us check the column names of the resulting dataframe. Now we get a MultiIndex names as a list of tuples. Each tuple gives us the original column name and the name of aggregation operation we did. In this example, we used mean. It can be other summary operations as well.

df.columns
MultiIndex([(      'pop', 'mean'),
            (  'lifeExp', 'mean'),
            ('gdpPercap', 'mean')],
           )

The column names/information are in two levels. We can access the values in each level using Pandas’ get_level_values() function.

With columns.get_level_values(0), we get the column names.

df.columns.get_level_values(0)
Index(['pop', 'lifeExp', 'gdpPercap'], dtype='object')

With get_level_values(1), we get the second level of column names, which is the aggregation function we used.

df.columns.get_level_values(1)
Index(['mean', 'mean', 'mean'], dtype='object')

Similarly, we can also get the index values using index.get_level_values() function. Here we get the values of the first index.

df.index.get_level_values(0)
Index(['Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Africa',
       'Africa', 'Africa', 'Africa', 'Africa', 'Africa', 'Americas',
       'Americas', 'Americas', 'Americas', 'Americas', 'Americas', 'Americas',
       'Americas', 'Americas', 'Americas', 'Americas', 'Americas', 'Asia',
       'Asia', 'Asia', 'Asia', 'Asia', 'Asia', 'Asia', 'Asia', 'Asia', 'Asia',
       'Asia', 'Asia', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe',
       'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe', 'Europe',
       'Oceania', 'Oceania', 'Oceania', 'Oceania', 'Oceania', 'Oceania',
       'Oceania', 'Oceania', 'Oceania', 'Oceania', 'Oceania', 'Oceania'],
      dtype='object', name='continent')

similarly, we can get the values of second index using index.get_level_values(1).

df.index.get_level_values(1)
Int64Index([1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997, 2002,
            2007, 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992, 1997,
            2002, 2007, 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987, 1992,
            1997, 2002, 2007, 1952, 1957, 1962, 1967, 1972, 1977, 1982, 1987,
            1992, 1997, 2002, 2007, 1952, 1957, 1962, 1967, 1972, 1977, 1982,
            1987, 1992, 1997, 2002, 2007],
           dtype='int64', name='year')

Fixing Column names after Pandas agg() function to summarize grouped data

Since we have both the variable name and the operation performed in two rows in the Multi-Index dataframe, we can use that and name our new columns correctly.

Here we combine them to create new column names using Pandas map() function.

df.columns.map('_'.join)
Index(['pop_mean', 'lifeExp_mean', 'gdpPercap_mean'], dtype='object')

We can change the column names of the dataframe.

df.columns=df.columns.map('_'.join)
df.head()

And now we have summarized dataframe with correct names. Using agg() function to summarize takes few more lines, but with right column names, when compared to Pandas’ mean() function.

Pandas groupby multiple variables: column names

The resulting dataframe is still Multi-Indexed and we can use reset_index() function to convert the row index or rownames as columns as before.

And we get a simple dataframe with right column names.

df=df.reset_index()
df.head()

continent	year	pop_mean	lifeExp_mean	gdpPercap_mean
0	Africa	1952	4.570010e+06	39.135500	1252.572466
1	Africa	1957	5.093033e+06	41.266346	1385.236062
2	Africa	1962	5.702247e+06	43.319442	1598.078825
3	Africa	1967	6.447875e+06	45.334538	2050.363801
4	Africa	1972	7.305376e+06	47.450942	2339.615674

Grouped Line Plots with Seaborn’s lineplot

In the above example, we computed summarized values for multiple columns. Typically, one might be interested in summary value of a single column, and making some visualization using the index variables. Let us take the approach that is similar to above example using agg() function.

In this example, we use single variable for computing summarized/aggregated values. Here we compute median life expectancy for each year and continent. We also create new appropriate column name as above.

df =gapminder.groupby(["continent","year"]).
              agg({'lifeExp': ["median"] })
df.columns=df.columns.map('_'.join)
df=df.reset_index()
df.head()
	continent	year	lifeExp_median
0	Africa	1952	38.8330
1	Africa	1957	40.5925
2	Africa	1962	42.6305
3	Africa	1967	44.6985
4	Africa	1972	47.0315

Note that, our resulting data is in tidy form and we can use Seaborn’s lineplot to make grouped line plots of median life expectancy over time for 5 continents.

plt.figure(figsize=(8,6))
sns.lineplot(x='year', y='lifeExp_median', hue="continent", data=df)
plt.xlabel("Year", size=14)
plt.ylabel("Median Life Expectancy", size=14)
plt.savefig("Multi_group_line_plot_Seaborn.png",
                    format='png',
                    dpi=150)

We get nice multiple lineplots with Seaborn.

Multi-group Line Plot with Seaborn

Pandas unstack function to get data in wide form

For some reason, if you don’t want the resulting data to be in tidy form, we can use unstack() function after computing the summarized values.

Here we use Pandas’ unstack() function after computing median lifeExp for each group. And we get our data in wide form. When you groupby multiple variables, by default the last level will be on the rows in the wide form.

gapminder.groupby(["year","continent"])['lifeExp'].median().unstack().head()
continent	Africa	Americas	Asia	Europe	Oceania
year					
1952	38.8330	54.745	44.869	65.900	69.255
1957	40.5925	56.074	48.284	67.650	70.295
1962	42.6305	58.299	49.325	69.525	71.085
1967	44.6985	60.523	53.655	70.610	71.310
1972	47.0315	63.441	56.950	70.885	71.910

If we want wide form data, but with different variable on column, we can specify the level or variable name to unstack() function. For example, to get year on columns, we would use unstack(“year”) as shown below.

gapminder.groupby(["year","continent"])['lifeExp'].median().unstack("year").head()

year	1952	1957	1962	1967	1972	1977	1982	1987	1992	1997	2002	2007
continent												
Africa	38.833	40.5925	42.6305	44.6985	47.0315	49.2725	50.756	51.6395	52.429	52.759	51.2355	52.9265
Americas	54.745	56.0740	58.2990	60.5230	63.4410	66.3530	67.405	69.4980	69.862	72.146	72.0470	72.8990
Asia	44.869	48.2840	49.3250	53.6550	56.9500	60.7650	63.739	66.2950	68.690	70.265	71.0280	72.3960
Europe	65.900	67.6500	69.5250	70.6100	70.8850	72.3350	73.490	74.8150	75.451	76.116	77.5365	78.6085
Oceania	69.255	70.2950	71.0850	71.3100	71.9100	72.8550	74.290	75.3200	76.945	78.190	79.7400	80.7195

One of the advantages with using unstack() is that we have sidestepped the multi-index to simple index and we can quickly make exploratory data visualization with different variables. In this example below, we make a line plot again between year and median lifeExp for each continent. However this time we simply use Pandas’ plot function by chaining the plot() function to the results from unstack().

gapminder.groupby(["year","continent"])['lifeExp'].median().unstack().plot()

And we get almost similar plot as before, since Pandas’ plot function calls Matplotlib under the hood.

Multi-group line plot from wide data: Pandas plot