Pandas’ GroupBy function is the bread and butter for many data munging activities. Groupby enables one of the most widely used paradigm “Split-Apply-Combine”, for doing data analysis. Sometimes you will be working NumPy arrays and may still want to perform groupby operations on the array.
Just recently wrote a blogpost inspired by Jake’s post on groupby from scratch using sparse matrix. A few weeks ago got into a situation to implement groupby function with NumPy.
Here is one way to implement Pandas’ groupby operation using NumPy.
import pandas as pd import numpy as np import matplotlib.pyplot as plt
Let us use Pandas to load gapminder data as a dataframe
# link to gapminder data from Carpentries data_url = 'http://bit.ly/2cLzoxH' gapminder = pd.read_csv(data_url) gapminder.head()
Let us say we want to compute mean life expectancy for each continent. Here, let us use Pandas’ groupby function to compute mean life expectancy for each continent. We can use chaining rule in Python to group the bigger dataframe into smaller continent specific dataframe and compute mean for each continent.
gapminder[['continent','lifeExp']].groupby('continent').mean()
Here we have the mean life expectancy computed using Pandas groupby function.
lifeExp continent Africa 48.865330 Americas 64.658737 Asia 60.064903 Europe 71.903686 Oceania 74.326208
Now let us use NumPy to perform groupby operation. First let us extract the columns of interest from the dataframe in to NumPy arrays.
# numPy array for lifeExp life_exp = gapminder[['lifeExp']].values # NumPy array for continent conts= gapminder[['continent']].values
Let us also get the groups, in this case five continents as an array.
>all_continents = gapminder['continent'].unique() >all_continents array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)
We can use List Comprehensions to go through each continent and compute mean life expectancy using NumPy’s slicing and mean function
[(i, life_exp[conts==i].mean()) for i in all_continents]
Voila, we have our results, that is the same as obtained by Pandas groupby function.
[('Asia', 60.064903232323225), ('Europe', 71.9036861111111), ('Africa', 48.86533012820513), ('Americas', 64.65873666666667), ('Oceania', 74.32620833333333)]
In summary, we implemented Pandas’ group by function from scratch using Python’s NumPy. In this example we grouped a single variable and computed mean for just one another variable. Tune in for a bit more advanced groupby operations with NumPy.