Pandas groupby: 13 Functions To Aggregate

Pandas Groupby: Aggregating Function

Pandas groupby function enables us to do “Split-Apply-Combine” data analysis paradigm easily. Basically, with Pandas groupby, we can split Pandas data frame into smaller groups using one or more variables. Pandas has a number of aggregating functions that reduce the dimension of the grouped object. In this post will examples of using 13 aggregating function after performing Pandas groupby operation.

Here are the 13 aggregating functions available in Pandas and quick summary of what it does.

  1. mean(): Compute mean of groups
  2. sum(): Compute sum of group values
  3. size(): Compute group sizes
  4. count(): Compute count of group
  5. std(): Standard deviation of groups
  6. var(): Compute variance of groups
  7. sem(): Standard error of the mean of groups
  8. describe(): Generates descriptive statistics
  9. first(): Compute first of group values
  10. last(): Compute last of group values
  11. nth() : Take nth value, or a subset if n is a list
  12. min(): Compute min of group values
  13. max(): Compute max of group values

Let us use gapminder data set and see examples of using each of the aggregating functions associated with Pandas groupby function.

import pandas as pd

Let us first load the gapminder data and filter the data for the sake of simplicity.

data_url = 'http://bit.ly/2cLzoxH'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)
print(gapminder.head(3))
# select two columns
gapminder_pop = gapminder[['continent','pop']]
gapminder_pop.head()
	continent	pop
0	Asia	8425333.0
1	Asia	9240934.0
2	Asia	10267083.0
3	Asia	11537966.0
4	Asia	13079460.0

After filtering, our dataframe has just two columns one for continent and the other for population.
In these examples, we will groupby “continent” and compute aggregating operation within each group/continent on the population values.

1. Pandas groupby: mean()

The aggregate function mean() computes mean values for each group. Here, pandas groupby followed by mean will compute mean population for each continent.
.

gapminder_pop.groupby("continent").mean()

The result is another Pandas dataframe with just single row for each continent with its mean population.

                    pop
continent              
Africa     9.916003e+06
Americas   2.450479e+07
Asia       7.703872e+07
Europe     1.716976e+07
Oceania    8.874672e+06

2. Pandas groupby: sum

The aggregating function sum() simply adds of values within each group. In this example, the sum() computes total population in each continent.

gapminder_pop.groupby("continent").sum()

Here is the resulting dataframe with total population for each group.

                    pop
continent              
Africa     6.187586e+09
Americas   7.351438e+09
Asia       3.050733e+10
Europe     6.181115e+09
Oceania    2.129921e+08

3. Pandas groupby: size()

The aggregating function size() computes the size per each group. In this example, the function size() computes the number of rows per each continent.

gapminder_pop.groupby("continent").size()

Here is the resulting dataframe after applying Pandas groupby operation on continent followed by the aggregating function size().

continent
Africa      624
Americas    300
Asia        396
Europe      360
Oceania      24
dtype: int64

4. Pandas groupby: count()

The aggregating function count() computes the number of values with in each group.

gapminder_pop.groupby("continent").count()

It is essentially the same the aggregating function as size, but ignores any missing values. The gapminder dataframe does not have any missing values, so the results from both the functions are the same.

           pop
continent     
Africa     624
Americas   300
Asia       396
Europe     360
Oceania     24

5. Pandas groupby: std()

The aggregating function std() computes standard deviation of the values within each group.

gapminder_pop.groupby("continent").std()

In our example, std() function computes standard deviation on population values per continent.

             pop
continent              
Africa     1.549092e+07
Americas   5.097943e+07
Asia       2.068852e+08
Europe     2.051944e+07
Oceania    6.506342e+06

6. Pandas grouby: var()

The aggregating function var() computes variance, an estimate of variability, for each column per group.

gapminder_pop.groupby("continent").var()

In this example, var() function computes variance in population values for each continent.

                    pop
continent              
Africa     2.399687e+14
Americas   2.598902e+15
Asia       4.280149e+16
Europe     4.210473e+14
Oceania    4.233249e+13

7. Pandas grouby: sem()

The aggregating function sem() computes standard error of the mean values for each group.

gapminder_pop.groupby("continent").sem()

In this example, sem() computes standard error of the mean values of population for each continent.

       pop
continent              
Africa     6.201332e+05
Americas   2.943299e+06
Asia       1.039637e+07
Europe     1.081469e+06
Oceania    1.328102e+06

8. Pandas describe():

The aggregating function describe() computes a quick summary of values per group. It computes the number of values, mean, std, the minimum value, maximum value and value at multiple percentiles.

gapminder_pop.groupby("continent").describe()

We get a detailed summary statistics for population in each continent.

             pop                                                               \
           count          mean           std       min         25%        50%   
continent                                                                       
Africa     624.0  9.916003e+06  1.549092e+07   60011.0  1342075.00  4579311.0   
Americas   300.0  2.450479e+07  5.097943e+07  662850.0  2962358.75  6227510.0   

                                     
                   75%          max  
continent                            
Africa     10801489.75  135031164.0  
Americas   18340309.00  301139947.0  

9. Pandas groupby: first()

The aggregating function first() gets the first row value within each group.

gapminder_pop.groupby("continent").first()

In our example, we get a data frame with first population value for each continent. Since the data is sorted alphabetically, we will get the alphabetically first population value in each continent.

                 pop
continent            
Africa      9279525.0
Americas   17876956.0
Asia        8425333.0
Europe      1282697.0
Oceania     8691212.0

10. Pandas groupby: last()

The aggregating function last() gets the last row value within each group.

gapminder_pop.groupby("continent").last()

In our example, we get a data frame with last population value for each continent.

                  pop
continent            
Africa     12311143.0
Americas   26084662.0
Asia       22211743.0
Europe     60776238.0
Oceania     4115771.0

11. Pandas groupby: n()

The aggregating function nth(), gives nth value, in each group. For example, if we want 10th value within each group, we specify 10 as argument to the function n(). The aggregating function n() can also take a list as argument and give us a subset of rows within each group.

gapminder_pop.groupby("continent").nth(10)

In our example, we get a data frame with a row for each continent and its 10th population value.

                  pop
continent            
Africa     31287142.0
Americas   38331121.0
Asia       25268405.0
Europe      3508512.0
Oceania    19546792.0

12. Pandas groupby: max

The aggregating function max() computes maximum value for each group.

gapminder_pop.groupby("continent").max()

Here we get dataframe with maximum population for each continent.

              pop
continent              
Africa     1.350312e+08
Americas   3.011399e+08
Asia       1.318683e+09
Europe     8.240100e+07
Oceania    2.043418e+07

13. Pandas groupby: min()

Similar to the max() function, Pandas also has min() function to get minimum value per group.

gapminder_pop.groupby("continent").min()
                 pop
continent           
Africa       60011.0
Americas    662850.0
Asia        120447.0
Europe      147962.0
Oceania    1994794.0

Okay, all of the examples above we had just two columns in our dataframe. And we used one column for groupby() and the other for computing some function. What about if you have multiple columns and you want to do different things on each of them. That sounds interesting right? Tune in for more aggregating followed by groupby() soon.