Getting Started with Pandas Groupby

Pandas Groupby
Pandas Groupby Examples

Pandas groupby function is one of the most useful functions enabling a bunch of data munging activities. A simple use case of groupby function is that we can group a bigger dataframe by a single variable in the dataframe into multiple smaller dataframes. Typically, after grouping by a variable, we perform some computations on each of the smaller dataframe.

In this post we will see examples of how to use Pandas groupby function. We will groupby a single variable in the dataframe, examine the resulting grouped dataframe, extract other variables from grouped dataframe, and perform simple summary computations like mean and median for each grouped dataframe.

Let us load Pandas to learn more about groupby() function.

In the simples cases, we can

# import pandas
>import pandas as pd
# import numpy
>import numpy as np

We will use the gapminder data to play with groupby function(). Here we directly load the data from github page with Pandas’ read_csv() function..

p2data = "https://raw.githubusercontent.com/cmdlinetips/data/master/gapminder-FiveYearData.csv"
gapminder=pd.read_csv(p2data)
gapminder.head()

Our data contains lifeEx, population and gdpPercap over years for world countries. gapminder data also has information about the continent each country belongs to.

	country	year	pop	continent	lifeExp	gdpPercap
0	Afghanistan	1952	8425333.0	Asia	28.801	779.445314
1	Afghanistan	1957	9240934.0	Asia	30.332	820.853030
2	Afghanistan	1962	10267083.0	Asia	31.997	853.100710
3	Afghanistan	1967	11537966.0	Asia	34.020	836.197138
4	Afghanistan	1972	13079460.0	Asia	36.088	739.981106

Let us use groupby function to groupby “continent” variable in gaominder data. We provide the variable that want to groupby as a list to groupby().

gapminder.groupby(["continent"])
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x1a199f5690>

Pandas groupby() function groups the gapminder dataframe into multiple groups, where each group correspond to each continent in the data. In the grouped object, each continent is a smaller dataframe.

Getting Groups from Pandas Groupby Object

To check the groups in the grouped object, we can use the method “groups” as shown below. Each group is a dictionary with the group variable as key and the rest of the data corresponding to the group as value.

gapminder.groupby(["continent"]).groups
{'Africa': Int64Index([  24,   25,   26,   27,   28,   29,   30,   31,   32,   33,
             ...
             1694, 1695, 1696, 1697, 1698, 1699, 1700, 1701, 1702, 1703],
            dtype='int64', length=624),
 'Americas': Int64Index([  48,   49,   50,   51,   52,   53,   54,   55,   56,   57,
             ...
             1634, 1635, 1636, 1637, 1638, 1639, 1640, 1641, 1642, 1643],
            dtype='int64', length=300),
 'Asia': Int64Index([   0,    1,    2,    3,    4,    5,    6,    7,    8,    9,
             ...
             1670, 1671, 1672, 1673, 1674, 1675, 1676, 1677, 1678, 1679],
            dtype='int64', length=396),
 'Europe': Int64Index([  12,   13,   14,   15,   16,   17,   18,   19,   20,   21,
             ...
             1598, 1599, 1600, 1601, 1602, 1603, 1604, 1605, 1606, 1607],
            dtype='int64', length=360),
 'Oceania': Int64Index([  60,   61,   62,   63,   64,   65,   66,   67,   68,   69,   70,
               71, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101,
             1102, 1103],
            dtype='int64')}

Getting a Specific Group as Dataframe from Pandas Groupby Object

We can also access smaller dataframe corresponding to a value of grouped object using get_group() function. For example, we used groupby() on continent variable and Pandas groupby() has created smaller dataframes for each continent. We can access the dataframe corresponding to a specific continent using get_group() function with the continent value as argument. Here we extract the dataframe corresponding to Africa continent with get_group() function.

gapminder.groupby(["continent"]).get_group('Africa').head()
country	year	pop	continent	lifeExp	gdpPercap
24	Algeria	1952	9279525.0	Africa	43.077	2449.008185
25	Algeria	1957	10270856.0	Africa	45.685	3013.976023
26	Algeria	1962	11000948.0	Africa	48.303	2550.816880
27	Algeria	1967	12760499.0	Africa	51.407	3246.991771
28	Algeria	1972	14760787.0	Africa	54.518	4182.663766

Let us subset a specific variable from each of smaller dataframe from grouped object. For example, in the example below we extract lifeExp values for each continent from the grouped object. This slicing functionality is extremely useful in down stream analysis.

gapminder.groupby(["continent"])['lifeExp']
<pandas.core.groupby.generic.SeriesGroupBy object at 0x1a199f53d0>

Subsetting for a column in the grouped object gives us the SeriesGroupBy object that can be used for additional analysis.

This post is part of the series on Pandas 101, a tutorial covering tips and tricks on using Pandas for data munging and analysis.

1 comment

Comments are closed.