• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Python / Pandas DataFrame / Pandas Group-By / How To Write Pandas GroupBy Function using Sparse Matrix?

How To Write Pandas GroupBy Function using Sparse Matrix?

March 16, 2019 by cmdlinetips

Pandas group-by function that helps perform the split-apply-combine pattern on data frames is bread and better for data wrangling in Python. Just came across a really cool blogpost titled “Group-by from scratch” by Jake Vanderplas, the author of Python Data Science Handbook. Jake implements multiple ways to implement group-by from scratch.

It is a must read post. One that was really interesting was the implementation of group-by functionality using Sparse Matrix in SciPy. Here is my attempt to understand that function.

Before that, let us load the packages needed.

import numpy as np
from scipy import sparse
import pandas as pd

We will use the same example as Jake did. Let us make up two lists; one containing alphabets named “keys” and the other containing a list of numbers.

keys   = ['A', 'B', 'C', 'A', 'B', 'C']
vals = [ 1,   2,   3,   4,   5,   6 ]

Let us first use Pandas’ groupby function fist. Let us create a dataframe from these two lists and store it as a Pandas dataframe.

>df = pd.DataFrame({'keys':keys,'vals':vals})
>df
      keys vals
0	A	1
1	B	2
2	C	3
3	A	4
4	B	5
5	C	6

Let us groupby the variable keys and summarize the values of the variable vals using sum function. Group-by function groups splits the data frame into multiple chunks, for each unique value of “keys” and apply “sum” function on vals in each chunk. And we will get a smaller dataframe with unique values of keys and their total

>df.groupby(keys).sum()
     vals
A	5
B	7
C	9

Using the same idea, we can use groupby on Series data structure. Here is the function to do it. Here the final output is a dictionary instead of a dataframe.

# pandas groupby function with Series
def pandas_groupby(keys, vals):
    return pd.Series(vals).groupby(keys).sum().to_dict()
pandas_groupby(keys, vals)
{'A': 5, 'B': 7, 'C': 9}

Writing Groupby from Scratch Using Sparse Matrix

Here is the cool little function that Jake implemented for groupby function using Sparse Matrix.

def sparse_groupby(keys, vals):
    unique_keys, row = np.unique(keys, return_inverse=True)
    col = np.arange(len(keys))
    mat = sparse.coo_matrix((vals, (row, col)))
    return dict(zip(unique_keys, mat.sum(1).flat))

Let us unpack the function a little. Our first goal is to convert the data in the two lists into Sparse Matrix. We need to get the data in ro, column, data tuple.

The first line uses NumPy’s unique function to get unique values of keys and its indices with return_inverse=True argument. It returns a tuple.

>np.unique(keys, return_inverse=True)
(array(['A', 'B', 'C'], dtype='<U1'), array([0, 1, 2, 0, 1, 2]))

Then we create an array for “column” with number of elements using np.arange.

>np.arange(len(keys))
array([0, 1, 2, 3, 4, 5])

Let us create sparse matrix with the row, col and values we have so far now. Basically, we will create 3 x 6 sparse COO matrix using Spipy’s sparse module, where the rows correspond to unique keys, and the rows correspond to indexes of our data.

# create sparse matrix
>mat = sparse.coo_matrix((vals, (row, col)))
>print(mat.todense())
[[1 0 0 4 0 0]
 [0 2 0 0 5 0]
 [0 0 3 0 0 6]]

The final statement collapses the sparse matrix by summing across each row, associating with the right keys, and converting to a dictionary.

>dict(zip(unique_keys, mat.sum(1).flat))
{'A': 5, 'B': 7, 'C': 9}

Voila, we have our own groupby function using sparse matrix ready!

If you are curious about how fast this sparse matrix groupby is when compared to Pandas groupby, check out Jake’s blogpost.

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default Thumbnail3 Ways To Create Sparse Matrix in COO Format with SciPy Default ThumbnailHow To Slice Rows and Columns of Sparse Matrix in Python? Default ThumbnailHow to Implement Pandas Groupby operation with NumPy? Default ThumbnailHow To Save Sparse Matrix in Python to Mtx and Npz file

Filed Under: Pandas Group-By Tagged With: Group-By Using Sparse Matrix, Pandas Group-By

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version