• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Pandas 101 / How to lump factors in Pandas

How to lump factors in Pandas

January 26, 2022 by cmdlinetips

Sometimes you would like to collapse least frequent values of a factor or character variable in to a new category “Other”. In R forcats library has a suit of functions for lumping the variables. This post contains a Pandas solution that can lump factors or values in three common ways.

First, we will see how to keep the most the frequent values in and lump the rest . Next we will lump the most frequent values by proportion instead of counts. And finally, we will see how to lump factors/levels/values below certain minimum counts. These three ways of lumping mirrors the rudimentary versions of forcats‘s fct_lump() functions,

  • fct_lump_n(): lumps all levels except for the n most frequent (or least frequent if n < 0)
  • fct_lump_prop(): lumps levels that appear in fewer prop * n times.
  • fct_lump_min(): lumps levels that appear fewer than min times.

A disclaimer, although I use the word “factor”, the column variable in Pandas is string/character.

The basic idea is simple, first we need find the values/levels that we would like to keep and then rename the rest to “Other”. We create a new variable/column to store the resulting lumped values.

In Pandas, we can use value_counts() function to count the number of observation for each unique value of a variable/column.

Let us load the packages needed.

import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt

We will use the NYC taxi dataset available as built-in dataset with Seaborn. The dataset has over 600 taxi rides with a number of columns.

taxis = sns.load_dataset("taxis")

One of the variables of interest is “dropoff_zone”. Apart from fewer most popular drop off zone, it contains a large number of drop off places with small frequency.

How to keep the top n most frequent levels and lump the rest in Pandas

Using Pandas’ value_count() to get the counts for each unique value on dropoff_zone, we see that the column has 203 unique values with most frequent being “245” and bunch of unique values with just 1 observation.

taxis.value_counts("dropoff_zone")

dropoff_zone
Upper East Side North    245
Murray Hill              220
Midtown Center           215
Upper East Side South    177
Midtown East             176
                        ... 
Jamaica Estates            1
Inwood Hill Park           1
Auburndale                 1
Homecrest                  1
Douglaston                 1
Length: 203, dtype: int64

We can get the most popular drop off zones by using Pandas value_counts() function and it gives unique values on a variable and their counts in descending order. By accessing the index from the resulting series, we get the top unique values of the variable.

taxis.value_counts("dropoff_zone").index

Index(['Upper East Side North', 'Murray Hill', 'Midtown Center',
       'Upper East Side South', 'Midtown East', 'Times Sq/Theatre District',
       'Lincoln Square East', 'Clinton East', 'East Village',
       'Penn Station/Madison Sq West',
       ...
       'Bath Beach', 'Bensonhurst East', 'West Brighton', 'East Flushing',
       'Madison', 'Jamaica Estates', 'Inwood Hill Park', 'Auburndale',
       'Homecrest', 'Douglaston'],
      dtype='object', name='dropoff_zone', length=203)

Let us keep the top 10 drop-off places and lump or rename the remaining into “Other”. Here is our top 10

top10 = taxis.value_counts("dropoff_zone").index[:10]
top10.tolist()

['Upper East Side North',
 'Murray Hill',
 'Midtown Center',
 'Upper East Side South',
 'Midtown East',
 'Times Sq/Theatre District',
 'Lincoln Square East',
 'Clinton East',
 'East Village',
 'Penn Station/Madison Sq West']

We can create a new variable (or change the existing variable) by keeping the top 10 places and renaming the other using apply() function. Pandas apply() function check if each element in the variable of interest is present in our top 10 list. If it is present, we keep the element as it is. Otherwise we return “Other”.

taxis['dropoff_top10'] = (taxis.
                          dropoff_zone.
                          apply(lambda x: x 
                                if x in top10 
                                else "Other"))

Now we can see that “Other” category is the most frequent and the remaining values are from top 10.

taxis.value_counts("dropoff_top10")


dropoff_top10
Other                           4635
Upper East Side North            245
Murray Hill                      220
Midtown Center                   215
Upper East Side South            177
Midtown East                     176
Times Sq/Theatre District        160
Lincoln Square East              156
Clinton East                     151
East Village                     150
Penn Station/Madison Sq West     148
dtype: int64

How to lumps levels that appear in fewer proportion in Pandas

To work with proportion, instead of counts, we can use Pandas value_counts() with normalize=TRUE argument. This would give us a series with most frequent on top.

top10_pct= (taxis.
            value_counts("dropoff_zone", 
                         normalize=True).
            index[:10])

We can use same apply() strategy to keep the top 10 by proportion and lump the rest..

taxis["dropoff_top10_pct"] = (taxis.
                              dropoff_zone.
                              apply(lambda x: x 
                                    if x in top10_pct
                                    else "Other"))
taxis.value_counts("dropoff_top10_pct")

dropoff_top10_pct
Other                           4635
Upper East Side North            245
Murray Hill                      220
Midtown Center                   215
Upper East Side South            177
Midtown East                     176
Times Sq/Theatre District        160
Lincoln Square East              156
Clinton East                     151
East Village                     150
Penn Station/Madison Sq West     148
dtype: int64

How to lump values that appear fewer than n times in Pandas?

Another useful variation in lumping is to use some minimum threshold for the number of time a value occurs. In this example we lump values that appear fewer than n=10 times and keep the values that occurs more than 10 times.

We use slightly a different way compared to the previous approaches to get the values that occurred more than some minimum threshold. We use loc method to lump values that appear fewer than 10 times.
[/sourcecode]

min10 = (taxis.
 dropoff_zone.
 value_counts().
 loc[lambda x: x >= 10])
min10

Upper East Side North    245
Murray Hill              220
Midtown Center           215
Upper East Side South    177
Midtown East             176
                        ... 
Chinatown                 11
East New York             11
South Ozone Park          11
West Concourse            10
Melrose South             10
Name: dropoff_zone, Length: 96, dtype: int64

And we use use apply() function as before to create new lumped variable.

taxis["dropoff_min10"] = (taxis.
                              dropoff_zone.
                              apply(lambda x: x 
                                    if x in min10
                                    else "Other"))
taxis["dropoff_min10"].value_counts()

Other                    441
Upper East Side North    245
Murray Hill              220
Midtown Center           215
Upper East Side South    177
                        ... 
Crown Heights South       11
East New York             11
Chinatown                 11
Melrose South             10
West Concourse            10
Name: dropoff_min10, Length: 97, dtype: int64

Although we have use apply() function to lump, one of the challenges that I see that it may be slower. There may be faster option to create new variable after lumping. That is for another day to explore.

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default ThumbnailHow to Get Frequency Counts of a Column in Pandas Dataframe: Pandas Tutorial Seaborn countplot with bars in descending orderBarplots and Countplot with Seaborn’s catplot Default ThumbnailPandas value_counts: How To Get Counts of Unique Variables in a Dataframe? forcats Version is here with four new fct_lump functionsIntroduction to the new lumping functions in forcats version 0.5.0

Filed Under: Pandas 101, Python Tagged With: Pandas lump least frequent values, Pandas lump values present in small numbers

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version