How to Implement Pandas Groupby operation with NumPy?

Pandas’ GroupBy function is the bread and butter for many data munging activities. Groupby enables one of the most widely used paradigm “Split-Apply-Combine”, for doing data analysis. Sometimes you will be working NumPy arrays and may still want to perform groupby operations on the array.

Just recently wrote a blogpost inspired by Jake’s post on groupby from scratch using sparse matrix. A few weeks ago got into a situation to implement groupby function with NumPy.

Here is one way to implement Pandas’ groupby operation using NumPy.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Let us use Pandas to load gapminder data as a dataframe

# link to gapminder data from Carpentries
data_url = 'http://bit.ly/2cLzoxH'
gapminder = pd.read_csv(data_url)
gapminder.head()

Let us say we want to compute mean life expectancy for each continent. Here, let us use Pandas’ groupby function to compute mean life expectancy for each continent. We can use chaining rule in Python to group the bigger dataframe into smaller continent specific dataframe and compute mean for each continent.

gapminder[['continent','lifeExp']].groupby('continent').mean()

Here we have the mean life expectancy computed using Pandas groupby function.

	lifeExp
continent	
Africa	48.865330
Americas 64.658737
Asia	60.064903
Europe	71.903686
Oceania	74.326208

Now let us use NumPy to perform groupby operation. First let us extract the columns of interest from the dataframe in to NumPy arrays.

# numPy array for lifeExp
life_exp = gapminder[['lifeExp']].values
# NumPy array for continent
conts= gapminder[['continent']].values

Let us also get the groups, in this case five continents as an array.

>all_continents = gapminder['continent'].unique()
>all_continents
array(['Asia', 'Europe', 'Africa', 'Americas', 'Oceania'], dtype=object)

We can use List Comprehensions to go through each continent and compute mean life expectancy using NumPy’s slicing and mean function

[(i, life_exp[conts==i].mean()) for i in all_continents]

Voila, we have our results, that is the same as obtained by Pandas groupby function.

[('Asia', 60.064903232323225),
 ('Europe', 71.9036861111111),
 ('Africa', 48.86533012820513),
 ('Americas', 64.65873666666667),
 ('Oceania', 74.32620833333333)]

In summary, we implemented Pandas’ group by function from scratch using Python’s NumPy. In this example we grouped a single variable and computed mean for just one another variable. Tune in for a bit more advanced groupby operations with NumPy.