One of most common use of Pandas’ groupby function is to compute some summary statistics on one or more variables in the dataframe. In this post we will see an example of how to compute mean on all numerical variables and a select variable after groupby operation.
Let us first load Pandas package.
import pandas as pd
We will use gapminder data set and we will load it directly from github page.
p2data = "https://raw.githubusercontent.com/cmdlinetips/data/master/gapminder-FiveYearData.csv" gapminder=pd.read_csv(p2data)
gapminder.head() country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710 3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138 4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
We learned earlier how to use Pandas groupby function on a variable and get multiple smaller dataframes or groups.
To split the dataframe by a variable into multiple smaller dataframes, we use groupby on a categorical variable in the dataframe. In this example, we groupby “continent” variable in the gapminder dataset.
gapminder.groupby(["continent"])
This gives us a Pandas grouped object, which contains a smaller dataframe for each continent. To compute mean values of all the numerical variables in the dataframe, we simply chain mean function to the Pandas groupby object as shown below.
Pandas Groupby and Sum on Multiple Variables
gapminder.groupby(["continent"]).mean()
This computes mean values for year, population, lifeExp, and gdpPercap for each continent in the gapminder dataset. Note that the result does not contain the country variable as we have computed mean for all countries in each continent.
year pop lifeExp gdpPercap continent Africa 1979.5 9.916003e+06 48.865330 2193.754578 Americas 1979.5 2.450479e+07 64.658737 7136.110356 Asia 1979.5 7.703872e+07 60.064903 7902.150428 Europe 1979.5 1.716976e+07 71.903686 14469.475533 Oceania 1979.5 8.874672e+06 74.326208 18621.609223
Pandas Groupby and Sum on Single Variable
Sometimes, you don’t want to compute mean values of all numerical variables, but only on select numerical variable. In the example below, we will see how to groupby and perform mean value of one numerical variable.
We can select a single variable from groupby object using the variable name. We get a Pandas groupby Series object
gapminder.groupby(["continent"])['lifeExp'] <pandas.core.groupby.generic.SeriesGroupBy object at 0x1a1e685190>
And as before, we can chain mean() function to get mean lifeExp for each continent.
gapminder.groupby(["continent"])['lifeExp'].mean()
Note that when we get mean value for a single variable, we get Series object in return.
continent Africa 48.865330 Americas 64.658737 Asia 60.064903 Europe 71.903686 Oceania 74.326208 Name: lifeExp, dtype: float64
This post is part of the series on Pandas 101, a tutorial covering tips and tricks on using Pandas for data munging and analysis.