Pandas groupby function enables us to do “Split-Apply-Combine” data analysis paradigm easily. Basically, with Pandas groupby, we can split Pandas data frame into smaller groups using one or more variables. Pandas has a number of aggregating functions that reduce the dimension of the grouped object. In this post will examples of using 13 aggregating function after performing Pandas groupby operation.
Here are the 13 aggregating functions available in Pandas and quick summary of what it does.
- mean(): Compute mean of groups
- sum(): Compute sum of group values
- size(): Compute group sizes
- count(): Compute count of group
- std(): Standard deviation of groups
- var(): Compute variance of groups
- sem(): Standard error of the mean of groups
- describe(): Generates descriptive statistics
- first(): Compute first of group values
- last(): Compute last of group values
- nth() : Take nth value, or a subset if n is a list
- min(): Compute min of group values
- max(): Compute max of group values
Let us use gapminder data set and see examples of using each of the aggregating functions associated with Pandas groupby function.
import pandas as pd
Let us first load the gapminder data and filter the data for the sake of simplicity.
data_url = 'http://bit.ly/2cLzoxH' # read data from url as pandas dataframe gapminder = pd.read_csv(data_url) print(gapminder.head(3)) # select two columns gapminder_pop = gapminder[['continent','pop']] gapminder_pop.head() continent pop 0 Asia 8425333.0 1 Asia 9240934.0 2 Asia 10267083.0 3 Asia 11537966.0 4 Asia 13079460.0
After filtering, our dataframe has just two columns one for continent and the other for population.
In these examples, we will groupby “continent” and compute aggregating operation within each group/continent on the population values.
1. Pandas groupby: mean()
The aggregate function mean() computes mean values for each group. Here, pandas groupby followed by mean will compute mean population for each continent.
.
gapminder_pop.groupby("continent").mean()
The result is another Pandas dataframe with just single row for each continent with its mean population.
pop continent Africa 9.916003e+06 Americas 2.450479e+07 Asia 7.703872e+07 Europe 1.716976e+07 Oceania 8.874672e+06
2. Pandas groupby: sum
The aggregating function sum() simply adds of values within each group. In this example, the sum() computes total population in each continent.
gapminder_pop.groupby("continent").sum()
Here is the resulting dataframe with total population for each group.
pop continent Africa 6.187586e+09 Americas 7.351438e+09 Asia 3.050733e+10 Europe 6.181115e+09 Oceania 2.129921e+08
3. Pandas groupby: size()
The aggregating function size() computes the size per each group. In this example, the function size() computes the number of rows per each continent.
gapminder_pop.groupby("continent").size()
Here is the resulting dataframe after applying Pandas groupby operation on continent followed by the aggregating function size().
continent Africa 624 Americas 300 Asia 396 Europe 360 Oceania 24 dtype: int64
4. Pandas groupby: count()
The aggregating function count() computes the number of values with in each group.
gapminder_pop.groupby("continent").count()
It is essentially the same the aggregating function as size, but ignores any missing values. The gapminder dataframe does not have any missing values, so the results from both the functions are the same.
pop continent Africa 624 Americas 300 Asia 396 Europe 360 Oceania 24
5. Pandas groupby: std()
The aggregating function std() computes standard deviation of the values within each group.
gapminder_pop.groupby("continent").std()
In our example, std() function computes standard deviation on population values per continent.
pop continent Africa 1.549092e+07 Americas 5.097943e+07 Asia 2.068852e+08 Europe 2.051944e+07 Oceania 6.506342e+06
6. Pandas grouby: var()
The aggregating function var() computes variance, an estimate of variability, for each column per group.
gapminder_pop.groupby("continent").var()
In this example, var() function computes variance in population values for each continent.
pop continent Africa 2.399687e+14 Americas 2.598902e+15 Asia 4.280149e+16 Europe 4.210473e+14 Oceania 4.233249e+13
7. Pandas grouby: sem()
The aggregating function sem() computes standard error of the mean values for each group.
gapminder_pop.groupby("continent").sem()
In this example, sem() computes standard error of the mean values of population for each continent.
pop continent Africa 6.201332e+05 Americas 2.943299e+06 Asia 1.039637e+07 Europe 1.081469e+06 Oceania 1.328102e+06
8. Pandas describe():
The aggregating function describe() computes a quick summary of values per group. It computes the number of values, mean, std, the minimum value, maximum value and value at multiple percentiles.
gapminder_pop.groupby("continent").describe()
We get a detailed summary statistics for population in each continent.
pop \ count mean std min 25% 50% continent Africa 624.0 9.916003e+06 1.549092e+07 60011.0 1342075.00 4579311.0 Americas 300.0 2.450479e+07 5.097943e+07 662850.0 2962358.75 6227510.0 75% max continent Africa 10801489.75 135031164.0 Americas 18340309.00 301139947.0
9. Pandas groupby: first()
The aggregating function first() gets the first row value within each group.
gapminder_pop.groupby("continent").first()
In our example, we get a data frame with first population value for each continent. Since the data is sorted alphabetically, we will get the alphabetically first population value in each continent.
pop continent Africa 9279525.0 Americas 17876956.0 Asia 8425333.0 Europe 1282697.0 Oceania 8691212.0
10. Pandas groupby: last()
The aggregating function last() gets the last row value within each group.
gapminder_pop.groupby("continent").last()
In our example, we get a data frame with last population value for each continent.
pop continent Africa 12311143.0 Americas 26084662.0 Asia 22211743.0 Europe 60776238.0 Oceania 4115771.0
11. Pandas groupby: n()
The aggregating function nth(), gives nth value, in each group. For example, if we want 10th value within each group, we specify 10 as argument to the function n(). The aggregating function n() can also take a list as argument and give us a subset of rows within each group.
gapminder_pop.groupby("continent").nth(10)
In our example, we get a data frame with a row for each continent and its 10th population value.
pop continent Africa 31287142.0 Americas 38331121.0 Asia 25268405.0 Europe 3508512.0 Oceania 19546792.0
12. Pandas groupby: max
The aggregating function max() computes maximum value for each group.
gapminder_pop.groupby("continent").max()
Here we get dataframe with maximum population for each continent.
pop continent Africa 1.350312e+08 Americas 3.011399e+08 Asia 1.318683e+09 Europe 8.240100e+07 Oceania 2.043418e+07
13. Pandas groupby: min()
Similar to the max() function, Pandas also has min() function to get minimum value per group.
gapminder_pop.groupby("continent").min()
pop continent Africa 60011.0 Americas 662850.0 Asia 120447.0 Europe 147962.0 Oceania 1994794.0
Okay, all of the examples above we had just two columns in our dataframe. And we used one column for groupby() and the other for computing some function. What about if you have multiple columns and you want to do different things on each of them. That sounds interesting right? Tune in for more aggregating followed by groupby() soon.