How To Write Pandas GroupBy Function using Sparse Matrix?

Pandas group-by function that helps perform the split-apply-combine pattern on data frames is bread and better for data wrangling in Python. Just came across a really cool blogpost titled “Group-by from scratch” by Jake Vanderplas, the author of Python Data Science Handbook. Jake implements multiple ways to implement group-by from scratch.

It is a must read post. One that was really interesting was the implementation of group-by functionality using Sparse Matrix in SciPy. Here is my attempt to understand that function.

Before that, let us load the packages needed.

import numpy as np
from scipy import sparse
import pandas as pd

We will use the same example as Jake did. Let us make up two lists; one containing alphabets named “keys” and the other containing a list of numbers.

keys   = ['A', 'B', 'C', 'A', 'B', 'C']
vals = [ 1,   2,   3,   4,   5,   6 ]

Let us first use Pandas’ groupby function fist. Let us create a dataframe from these two lists and store it as a Pandas dataframe.

>df = pd.DataFrame({'keys':keys,'vals':vals})
>df
      keys vals
0	A	1
1	B	2
2	C	3
3	A	4
4	B	5
5	C	6

Let us groupby the variable keys and summarize the values of the variable vals using sum function. Group-by function groups splits the data frame into multiple chunks, for each unique value of “keys” and apply “sum” function on vals in each chunk. And we will get a smaller dataframe with unique values of keys and their total

>df.groupby(keys).sum()
     vals
A	5
B	7
C	9

Using the same idea, we can use groupby on Series data structure. Here is the function to do it. Here the final output is a dictionary instead of a dataframe.

# pandas groupby function with Series
def pandas_groupby(keys, vals):
    return pd.Series(vals).groupby(keys).sum().to_dict()
pandas_groupby(keys, vals)
{'A': 5, 'B': 7, 'C': 9}

Writing Groupby from Scratch Using Sparse Matrix

Here is the cool little function that Jake implemented for groupby function using Sparse Matrix.

def sparse_groupby(keys, vals):
    unique_keys, row = np.unique(keys, return_inverse=True)
    col = np.arange(len(keys))
    mat = sparse.coo_matrix((vals, (row, col)))
    return dict(zip(unique_keys, mat.sum(1).flat))

Let us unpack the function a little. Our first goal is to convert the data in the two lists into Sparse Matrix. We need to get the data in ro, column, data tuple.

The first line uses NumPy’s unique function to get unique values of keys and its indices with return_inverse=True argument. It returns a tuple.

>np.unique(keys, return_inverse=True)
(array(['A', 'B', 'C'], dtype='<U1'), array([0, 1, 2, 0, 1, 2]))

Then we create an array for “column” with number of elements using np.arange.

>np.arange(len(keys))
array([0, 1, 2, 3, 4, 5])

Let us create sparse matrix with the row, col and values we have so far now. Basically, we will create 3 x 6 sparse COO matrix using Spipy’s sparse module, where the rows correspond to unique keys, and the rows correspond to indexes of our data.

# create sparse matrix
>mat = sparse.coo_matrix((vals, (row, col)))
>print(mat.todense())
[[1 0 0 4 0 0]
 [0 2 0 0 5 0]
 [0 0 3 0 0 6]]

The final statement collapses the sparse matrix by summing across each row, associating with the right keys, and converting to a dictionary.

>dict(zip(unique_keys, mat.sum(1).flat))
{'A': 5, 'B': 7, 'C': 9}

Voila, we have our own groupby function using sparse matrix ready!

If you are curious about how fast this sparse matrix groupby is when compared to Pandas groupby, check out Jake’s blogpost.