Sometimes you would like to collapse least frequent values of a factor or character variable in to a new category “Other”. In R forcats library has a suit of functions for lumping the variables. This post contains a Pandas solution that can lump factors or values in three common ways.
First, we will see how to keep the most the frequent values in and lump the rest . Next we will lump the most frequent values by proportion instead of counts. And finally, we will see how to lump factors/levels/values below certain minimum counts. These three ways of lumping mirrors the rudimentary versions of forcats‘s fct_lump() functions,
- fct_lump_n(): lumps all levels except for the n most frequent (or least frequent if n < 0)
- fct_lump_prop(): lumps levels that appear in fewer prop * n times.
- fct_lump_min(): lumps levels that appear fewer than min times.
A disclaimer, although I use the word “factor”, the column variable in Pandas is string/character.
The basic idea is simple, first we need find the values/levels that we would like to keep and then rename the rest to “Other”. We create a new variable/column to store the resulting lumped values.
In Pandas, we can use value_counts() function to count the number of observation for each unique value of a variable/column.
Let us load the packages needed.
import seaborn as sns import pandas as pd import matplotlib.pyplot as plt
We will use the NYC taxi dataset available as built-in dataset with Seaborn. The dataset has over 600 taxi rides with a number of columns.
taxis = sns.load_dataset("taxis")
One of the variables of interest is “dropoff_zone”. Apart from fewer most popular drop off zone, it contains a large number of drop off places with small frequency.
How to keep the top n most frequent levels and lump the rest in Pandas
Using Pandas’ value_count() to get the counts for each unique value on dropoff_zone, we see that the column has 203 unique values with most frequent being “245” and bunch of unique values with just 1 observation.
taxis.value_counts("dropoff_zone") dropoff_zone Upper East Side North 245 Murray Hill 220 Midtown Center 215 Upper East Side South 177 Midtown East 176 ... Jamaica Estates 1 Inwood Hill Park 1 Auburndale 1 Homecrest 1 Douglaston 1 Length: 203, dtype: int64
We can get the most popular drop off zones by using Pandas value_counts() function and it gives unique values on a variable and their counts in descending order. By accessing the index from the resulting series, we get the top unique values of the variable.
taxis.value_counts("dropoff_zone").index Index(['Upper East Side North', 'Murray Hill', 'Midtown Center', 'Upper East Side South', 'Midtown East', 'Times Sq/Theatre District', 'Lincoln Square East', 'Clinton East', 'East Village', 'Penn Station/Madison Sq West', ... 'Bath Beach', 'Bensonhurst East', 'West Brighton', 'East Flushing', 'Madison', 'Jamaica Estates', 'Inwood Hill Park', 'Auburndale', 'Homecrest', 'Douglaston'], dtype='object', name='dropoff_zone', length=203)
Let us keep the top 10 drop-off places and lump or rename the remaining into “Other”. Here is our top 10
top10 = taxis.value_counts("dropoff_zone").index[:10] top10.tolist() ['Upper East Side North', 'Murray Hill', 'Midtown Center', 'Upper East Side South', 'Midtown East', 'Times Sq/Theatre District', 'Lincoln Square East', 'Clinton East', 'East Village', 'Penn Station/Madison Sq West']
We can create a new variable (or change the existing variable) by keeping the top 10 places and renaming the other using apply() function. Pandas apply() function check if each element in the variable of interest is present in our top 10 list. If it is present, we keep the element as it is. Otherwise we return “Other”.
taxis['dropoff_top10'] = (taxis. dropoff_zone. apply(lambda x: x if x in top10 else "Other"))
Now we can see that “Other” category is the most frequent and the remaining values are from top 10.
taxis.value_counts("dropoff_top10") dropoff_top10 Other 4635 Upper East Side North 245 Murray Hill 220 Midtown Center 215 Upper East Side South 177 Midtown East 176 Times Sq/Theatre District 160 Lincoln Square East 156 Clinton East 151 East Village 150 Penn Station/Madison Sq West 148 dtype: int64
How to lumps levels that appear in fewer proportion in Pandas
To work with proportion, instead of counts, we can use Pandas value_counts() with normalize=TRUE argument. This would give us a series with most frequent on top.
top10_pct= (taxis. value_counts("dropoff_zone", normalize=True). index[:10])
We can use same apply() strategy to keep the top 10 by proportion and lump the rest..
taxis["dropoff_top10_pct"] = (taxis. dropoff_zone. apply(lambda x: x if x in top10_pct else "Other"))
taxis.value_counts("dropoff_top10_pct") dropoff_top10_pct Other 4635 Upper East Side North 245 Murray Hill 220 Midtown Center 215 Upper East Side South 177 Midtown East 176 Times Sq/Theatre District 160 Lincoln Square East 156 Clinton East 151 East Village 150 Penn Station/Madison Sq West 148 dtype: int64
How to lump values that appear fewer than n times in Pandas?
Another useful variation in lumping is to use some minimum threshold for the number of time a value occurs. In this example we lump values that appear fewer than n=10 times and keep the values that occurs more than 10 times.
We use slightly a different way compared to the previous approaches to get the values that occurred more than some minimum threshold. We use loc method to lump values that appear fewer than 10 times.
[/sourcecode]
min10 = (taxis. dropoff_zone. value_counts(). loc[lambda x: x >= 10])
min10 Upper East Side North 245 Murray Hill 220 Midtown Center 215 Upper East Side South 177 Midtown East 176 ... Chinatown 11 East New York 11 South Ozone Park 11 West Concourse 10 Melrose South 10 Name: dropoff_zone, Length: 96, dtype: int64
And we use use apply() function as before to create new lumped variable.
taxis["dropoff_min10"] = (taxis. dropoff_zone. apply(lambda x: x if x in min10 else "Other"))
taxis["dropoff_min10"].value_counts() Other 441 Upper East Side North 245 Murray Hill 220 Midtown Center 215 Upper East Side South 177 ... Crown Heights South 11 East New York 11 Chinatown 11 Melrose South 10 West Concourse 10 Name: dropoff_min10, Length: 97, dtype: int64
Although we have use apply() function to lump, one of the challenges that I see that it may be slower. There may be faster option to create new variable after lumping. That is for another day to explore.