Pandas Cumulative Sum by Group

In this tutorial as part of our Pandas 101 series, we will learn how to compute cumulative sum of a column based on values from a grouping column in Pandas dataframe. Pandas cumsum() function can compute cumulative sum over a DataFrame, In this example we are interested getting cumulative sum of just one column by a group.

The basic strategy to get cumulative sum by group is illustrated in the figure below and it is in the same flavour as the classic “split-apply-combine” principle for doing data analysis.

Pandas cumsum() to compute cumulative sum by group

Let us get started by loading modules needed to create a toy dataframe with groups and compute cumulative sum.

import pandas as pd
import string
import random

We will sample from letters to create group values using random module’s choices() function.

random.seed(2022)
groups = random.choices(string.ascii_letters[0:2],k=6)
values = range(6)

We can create our toy dataframe with the lists created in the above step. Our dataframe has two columns, one is grouping variable with two values and the second is the column to use for computing cumulative sum.

df = pd.DataFrame({'C1': groups,
                  'C2': values})
df


      C1	C2
0	b	0
1	a	1
2	a	2
3	a	3
4	b	4
5	b	5

Pandas cumsum() can compute cumulative sum for the whole dataframe. Here is an example of applying cumsum() to our toy dataframe. cumsum() function has computed cumulative sum of both the string and numerical columns. cumulative sum of string variable is simple concatenation.

df.cumsum()

	C1	C2
0	b	0
1	ba	1
2	baa	3
3	baaa	6
4	baaab	10
5	baaabb	15

However, what we need is to compute cumulative sum within each grouping values. We can see that our grouping variable has two values.

df.groupby("C1").groups

{'a': [1, 2, 3], 'b': [0, 4, 5]}

Since we are interested in computing cumulative sum of a column, we select the column of interest after groupby and apply cumsum().

Here we have also added the cumulative sum values as an additional column to the dataframe.

df["cum_sum"] = (df.
                 groupby("C1")['C2'].
                 cumsum())
df

      C1	C2	g_cumsum
0	b	0	0
1	a	1	1
2	a	2	3
3	a	3	6
4	b	4	4
5	b	5	9

Share this:

Related posts: