One of the common operations of data analysis is group the data by a variable and compute some sumamry statistics on the sub-group of data. In this post, we will see an example of how to use groupby() function in Pandas to group a dataframe into multiple smaller dataframes and compute median on another variable in each smaller dataframe.
Pandas have multiple summary functions to apply on groupby() object and we will use median() function to compute median
First, let us load Pandas and NumPy libraries.
import pandas as pd import numpy as np
We will use gapminder data to perform groupby and compute median. Let us load gapminder data directly from web, cmdlinetips.com‘s github page.
p2data = "https://raw.githubusercontent.com/cmdlinetips/data/master/gapminder-FiveYearData.csv" gapminder=pd.read_csv(p2data)
gapminder.head() country year pop continent lifeExp gdpPercap 0 Afghanistan 1952 8425333.0 Asia 28.801 779.445314 1 Afghanistan 1957 9240934.0 Asia 30.332 820.853030 2 Afghanistan 1962 10267083.0 Asia 31.997 853.100710 3 Afghanistan 1967 11537966.0 Asia 34.020 836.197138 4 Afghanistan 1972 13079460.0 Asia 36.088 739.981106
Let us perform groupby() operation on continent variable in gapminder data. Under the good, Pandas splits the dataframe into multiple smaller dataframes for each value of continent values and gives us groupby object.
gapminder.groupby(["continent"]) <pandas.core.groupby.generic.DataFrameGroupBy object at 0x1c1d9f6f50>
From the Pandas’ groupby object we can extract one or more variables in the dataframe. This is another groupby object.
gapminder.groupby(["continent"])['lifeExp'] <pandas.core.groupby.generic.SeriesGroupBy object at 0x1c1d9f6110>
Now we can apply summary function like median on the variable to compute summary stat for each value of groupby variable.
In this example, we compute median value for each continent. And it gives the answer we want as Pandas Series.
gapminder.groupby(["continent"])['lifeExp'].median() continent Africa 48.865330 Americas 64.658737 Asia 60.064903 Europe 71.903686 Oceania 74.326208 Name: lifeExp, dtype: float64
This post is part of the series on Pandas 101, a tutorial covering tips and tricks on using Pandas for data munging and analysis.
1 comment
Comments are closed.