How to lump factors in Pandas

Sometimes you would like to collapse least frequent values of a factor or character variable in to a new category “Other”. In R forcats library has a suit of functions for lumping the variables. This post contains a Pandas solution that can lump factors or values in three common ways.

First, we will see how to keep the most the frequent values in and lump the rest . Next we will lump the most frequent values by proportion instead of counts. And finally, we will see how to lump factors/levels/values below certain minimum counts. These three ways of lumping mirrors the rudimentary versions of forcats‘s fct_lump() functions,

  • fct_lump_n(): lumps all levels except for the n most frequent (or least frequent if n < 0)
  • fct_lump_prop(): lumps levels that appear in fewer prop * n times.
  • fct_lump_min(): lumps levels that appear fewer than min times.

A disclaimer, although I use the word “factor”, the column variable in Pandas is string/character.

The basic idea is simple, first we need find the values/levels that we would like to keep and then rename the rest to “Other”. We create a new variable/column to store the resulting lumped values.

In Pandas, we can use value_counts() function to count the number of observation for each unique value of a variable/column.

Let us load the packages needed.

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

We will use the NYC taxi dataset available as built-in dataset with Seaborn. The dataset has over 600 taxi rides with a number of columns.

taxis = sns.load_dataset("taxis")

One of the variables of interest is “dropoff_zone”. Apart from fewer most popular drop off zone, it contains a large number of drop off places with small frequency.

How to keep the top n most frequent levels and lump the rest in Pandas

Using Pandas’ value_count() to get the counts for each unique value on dropoff_zone, we see that the column has 203 unique values with most frequent being “245” and bunch of unique values with just 1 observation.

taxis.value_counts("dropoff_zone")

dropoff_zone
Upper East Side North    245
Murray Hill              220
Midtown Center           215
Upper East Side South    177
Midtown East             176
                        ... 
Jamaica Estates            1
Inwood Hill Park           1
Auburndale                 1
Homecrest                  1
Douglaston                 1
Length: 203, dtype: int64

We can get the most popular drop off zones by using Pandas value_counts() function and it gives unique values on a variable and their counts in descending order. By accessing the index from the resulting series, we get the top unique values of the variable.

taxis.value_counts("dropoff_zone").index

Index(['Upper East Side North', 'Murray Hill', 'Midtown Center',
       'Upper East Side South', 'Midtown East', 'Times Sq/Theatre District',
       'Lincoln Square East', 'Clinton East', 'East Village',
       'Penn Station/Madison Sq West',
       ...
       'Bath Beach', 'Bensonhurst East', 'West Brighton', 'East Flushing',
       'Madison', 'Jamaica Estates', 'Inwood Hill Park', 'Auburndale',
       'Homecrest', 'Douglaston'],
      dtype='object', name='dropoff_zone', length=203)

Let us keep the top 10 drop-off places and lump or rename the remaining into “Other”. Here is our top 10

top10 = taxis.value_counts("dropoff_zone").index[:10]
top10.tolist()

['Upper East Side North',
 'Murray Hill',
 'Midtown Center',
 'Upper East Side South',
 'Midtown East',
 'Times Sq/Theatre District',
 'Lincoln Square East',
 'Clinton East',
 'East Village',
 'Penn Station/Madison Sq West']

We can create a new variable (or change the existing variable) by keeping the top 10 places and renaming the other using apply() function. Pandas apply() function check if each element in the variable of interest is present in our top 10 list. If it is present, we keep the element as it is. Otherwise we return “Other”.

taxis['dropoff_top10'] = (taxis.
                          dropoff_zone.
                          apply(lambda x: x 
                                if x in top10 
                                else "Other"))

Now we can see that “Other” category is the most frequent and the remaining values are from top 10.

taxis.value_counts("dropoff_top10")


dropoff_top10
Other                           4635
Upper East Side North            245
Murray Hill                      220
Midtown Center                   215
Upper East Side South            177
Midtown East                     176
Times Sq/Theatre District        160
Lincoln Square East              156
Clinton East                     151
East Village                     150
Penn Station/Madison Sq West     148
dtype: int64

How to lumps levels that appear in fewer proportion in Pandas

To work with proportion, instead of counts, we can use Pandas value_counts() with normalize=TRUE argument. This would give us a series with most frequent on top.

top10_pct= (taxis.
            value_counts("dropoff_zone", 
                         normalize=True).
            index[:10])

We can use same apply() strategy to keep the top 10 by proportion and lump the rest..

taxis["dropoff_top10_pct"] = (taxis.
                              dropoff_zone.
                              apply(lambda x: x 
                                    if x in top10_pct
                                    else "Other"))
taxis.value_counts("dropoff_top10_pct")

dropoff_top10_pct
Other                           4635
Upper East Side North            245
Murray Hill                      220
Midtown Center                   215
Upper East Side South            177
Midtown East                     176
Times Sq/Theatre District        160
Lincoln Square East              156
Clinton East                     151
East Village                     150
Penn Station/Madison Sq West     148
dtype: int64

How to lump values that appear fewer than n times in Pandas?

Another useful variation in lumping is to use some minimum threshold for the number of time a value occurs. In this example we lump values that appear fewer than n=10 times and keep the values that occurs more than 10 times.

We use slightly a different way compared to the previous approaches to get the values that occurred more than some minimum threshold. We use loc method to lump values that appear fewer than 10 times.
[/sourcecode]

min10 = (taxis.
 dropoff_zone.
 value_counts().
 loc[lambda x: x >= 10])
min10

Upper East Side North    245
Murray Hill              220
Midtown Center           215
Upper East Side South    177
Midtown East             176
                        ... 
Chinatown                 11
East New York             11
South Ozone Park          11
West Concourse            10
Melrose South             10
Name: dropoff_zone, Length: 96, dtype: int64

And we use use apply() function as before to create new lumped variable.

taxis["dropoff_min10"] = (taxis.
                              dropoff_zone.
                              apply(lambda x: x 
                                    if x in min10
                                    else "Other"))
taxis["dropoff_min10"].value_counts()

Other                    441
Upper East Side North    245
Murray Hill              220
Midtown Center           215
Upper East Side South    177
                        ... 
Crown Heights South       11
East New York             11
Chinatown                 11
Melrose South             10
West Concourse            10
Name: dropoff_min10, Length: 97, dtype: int64

Although we have use apply() function to lump, one of the challenges that I see that it may be slower. There may be faster option to create new variable after lumping. That is for another day to explore.