Pandas groupby function is one of the most useful functions enabling a bunch of data munging activities. A simple use case of groupby function is that we can group a bigger dataframe by a single variable in the dataframe into multiple smaller dataframes. Typically, after grouping by a variable, we perform some computations on each of the smaller dataframe.
In this post we will see examples of how to use Pandas groupby function. We will groupby a single variable in the dataframe, examine the resulting grouped dataframe, extract other variables from grouped dataframe, and perform simple summary computations like mean and median for each grouped dataframe.
Let us load Pandas to learn more about groupby() function.
In the simples cases, we can
# import pandas >import pandas as pd # import numpy >import numpy as np
We will use the gapminder data to play with groupby function(). Here we directly load the data from github page with Pandas’ read_csv() function..
p2data = "https://raw.githubusercontent.com/cmdlinetips/data/master/gapminder-FiveYearData.csv" gapminder=pd.read_csv(p2data) gapminder.head()
Our data contains lifeEx, population and gdpPercap over years for world countries. gapminder data also has information about the continent each country belongs to.
country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710 3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138 4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
Let us use groupby function to groupby “continent” variable in gaominder data. We provide the variable that want to groupby as a list to groupby().
gapminder.groupby(["continent"]) <pandas.core.groupby.generic.DataFrameGroupBy object at 0x1a199f5690>
Pandas groupby() function groups the gapminder dataframe into multiple groups, where each group correspond to each continent in the data. In the grouped object, each continent is a smaller dataframe.
Getting Groups from Pandas Groupby Object
To check the groups in the grouped object, we can use the method “groups” as shown below. Each group is a dictionary with the group variable as key and the rest of the data corresponding to the group as value.
gapminder.groupby(["continent"]).groups {'Africa': Int64Index([ 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, ... 1694, 1695, 1696, 1697, 1698, 1699, 1700, 1701, 1702, 1703], dtype='int64', length=624), 'Americas': Int64Index([ 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, ... 1634, 1635, 1636, 1637, 1638, 1639, 1640, 1641, 1642, 1643], dtype='int64', length=300), 'Asia': Int64Index([ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, ... 1670, 1671, 1672, 1673, 1674, 1675, 1676, 1677, 1678, 1679], dtype='int64', length=396), 'Europe': Int64Index([ 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, ... 1598, 1599, 1600, 1601, 1602, 1603, 1604, 1605, 1606, 1607], dtype='int64', length=360), 'Oceania': Int64Index([ 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 1092, 1093, 1094, 1095, 1096, 1097, 1098, 1099, 1100, 1101, 1102, 1103], dtype='int64')}
Getting a Specific Group as Dataframe from Pandas Groupby Object
We can also access smaller dataframe corresponding to a value of grouped object using get_group() function. For example, we used groupby() on continent variable and Pandas groupby() has created smaller dataframes for each continent. We can access the dataframe corresponding to a specific continent using get_group() function with the continent value as argument. Here we extract the dataframe corresponding to Africa continent with get_group() function.
gapminder.groupby(["continent"]).get_group('Africa').head() country year pop continent lifeExp gdpPercap 24 Algeria 1952 9279525.0 Africa 43.077 2449.008185 25 Algeria 1957 10270856.0 Africa 45.685 3013.976023 26 Algeria 1962 11000948.0 Africa 48.303 2550.816880 27 Algeria 1967 12760499.0 Africa 51.407 3246.991771 28 Algeria 1972 14760787.0 Africa 54.518 4182.663766
Let us subset a specific variable from each of smaller dataframe from grouped object. For example, in the example below we extract lifeExp values for each continent from the grouped object. This slicing functionality is extremely useful in down stream analysis.
gapminder.groupby(["continent"])['lifeExp'] <pandas.core.groupby.generic.SeriesGroupBy object at 0x1a199f53d0>
Subsetting for a column in the grouped object gives us the SeriesGroupBy object that can be used for additional analysis.
This post is part of the series on Pandas 101, a tutorial covering tips and tricks on using Pandas for data munging and analysis.
1 comment
Comments are closed.