• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / R / forcats 0.5.0 / Introduction to the new lumping functions in forcats version 0.5.0

Introduction to the new lumping functions in forcats version 0.5.0

March 4, 2020 by cmdlinetips

forcats Version is here with four new fct_lump functions
forcats version 0.5.0
forcats, one of the key tidyverse R packages, for dealing with factors in R has a new version 0.5.0 on CRAN with a lot of new changes.

If you have not heard of forcats before, it is a R package part of tidyverse that provides “a suite of tools that solve common problems with factors, including changing the order of levels or the values”. For example, fct_reorder() from forcats is one of the commonly used functions from forcats for reordering factor levels. And fct_lump() is another function in forcats that helps lumping factor levels in your data in a variety of ways. We can use fct_lump() to collapse the least frequent values of a factor into a new category called “other”.

Starting with the new version, you have four new fct_lump() family functions that does specific type of collapsing. The four new functions fct_lump_min(), fct_lump_prop(), fct_lump_n() and fct_lump_lowfreq() divides the orginal forcats function fct_lump()’s capabilities. The original fct_lump() function will exist for historical reasons and Rstudio team “no longer recommend that you use it”.

  1. fct_lump_min(): lumps levels that appear fewer than min times.
  2. fct_lump_prop(): lumps levels that appear fewer than prop
  3. fct_lump_n(): lumps all levels except for the n most frequent (or least frequent, if n < 0).
  4. fct_lump_lowfreq(): lumps together the least frequent levels, ensuring that “Other” is still the smallest level.

Let us work through examples of the new fct_lump family of functions with starwars data as tidyverse vignette for forcats. We can install the new version of forcats from CRAN using

# install forcats
install.packages("forcats")

And let us verify we have the new forcats version installed with packageVersion.

# check we have the version 0.5.0
packageVersion("forcats")

## [1] '0.5.0'

Let us load the packages needed.

library(tidyverse)

We will use the starwars data to illustrate the use cases of the new fct_lump_*() functions.

starwars %>% head()

## # A tibble: 6 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender homeworld
##   &lt;chr&gt;  &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;      &lt;chr&gt;      &lt;chr&gt;          &lt;dbl&gt; &lt;chr&gt;  &lt;chr&gt;    
## 1 Luke…    172    77 blond      fair       blue            19   male   Tatooine 
## 2 C-3PO    167    75 &lt;NA&gt;       gold       yellow         112   &lt;NA&gt;   Tatooine 
## 3 R2-D2     96    32 &lt;NA&gt;       white, bl… red             33   &lt;NA&gt;   Naboo    
## 4 Dart…    202   136 none       white      yellow          41.9 male   Tatooine 
## 5 Leia…    150    49 brown      light      brown           19   female Alderaan 
## 6 Owen…    178   120 brown, gr… light      blue            52   male   Tatooine 
## # … with 4 more variables: species &lt;chr&gt;, films &lt;list&gt;, vehicles &lt;list&gt;,
## #   starships &lt;list&gt;

The factor levels we will focus on and lump is skin_color of starwars characters. We can see that there are 31 different skin colors with different frequency.

starwars %>%
  count(skin_color, sort = TRUE)

## # A tibble: 31 x 2
##    skin_color     n
##    &lt;chr&gt;      &lt;int&gt;
##  1 fair          17
##  2 light         11
##  3 dark           6
##  4 green          6
##  5 grey           6
##  6 pale           5
##  7 brown          4
##  8 blue           2
##  9 blue, grey     2
## 10 orange         2
## # … with 21 more rows

fct_lump_min()

fct_lump_min() function lumps factor levels that appear less than min times. In this example, we are lumping levels that appear fewer than 5 times in starwars data and assign to new category “Other”.

starwars %>%
  mutate(skin_color = fct_lump_min(skin_color, 5)) %>%
  count(skin_color, sort = TRUE)

We can see that after lumping all the levels that appear fewer than 5 times, our factor level counts would like the tibble shown below.

## # A tibble: 7 x 2
##   skin_color     n
##   &lt;fct&gt;      &lt;int&gt;
## 1 Other         36
## 2 fair          17
## 3 light         11
## 4 dark           6
## 5 green          6
## 6 grey           6
## 7 pale           5

fct_lump_n()

fct_lump_n() lumps all levels except for the n most frequent levels into “Other” category. Let us use fct_lump_n() on skin_color and keep top 5 levels and we get new factor level counts as shown below.

starwars %>%
  mutate(skin_color = fct_lump_n(skin_color, 5)) %>%
  count(skin_color, sort = TRUE)

“Other” category is most frequent now and the rest are the 5 most frequent factor levels in the star war data.

## # A tibble: 6 x 2
##   skin_color     n
##   &lt;fct&gt;      &lt;int&gt;
## 1 Other         41
## 2 fair          17
## 3 light         11
## 4 dark           6
## 5 green          6
## 6 grey           6

fct_lump_prop()

Sometime you might want lump factor levels based on their proportion not counts. fct_lump_prop() lumps levels that appear in fewer than prop of the time. For example, to keep skin colors that appear in at least 10% of the characters we would use fct_lump_prop() on skin_color with 0.1 as shown below.

starwars %>%
  mutate(skin_color = fct_lump_prop(skin_color, 0.1)) %>%
  count(skin_color, sort = TRUE)

In this case we lump all levels except two levels into “Other”.

## # A tibble: 3 x 2
##   skin_color     n
##   &lt;fct&gt;      &lt;int&gt;
## 1 Other         59
## 2 fair          17
## 3 light         11

fct_lump_lowfreq()

fct_lump_lowfreq() lumps together the least frequent levels, ensuring that “Other” is still the smallest level. If we try fct_lump_lowfreq() on starwars data, nothing would really change. This is mainly because the least frequent level has counts 1 and we can not lump further ensuring “Other” is still the smallest level.

starwars %>%
  mutate(skin_color = fct_lump_lowfreq(skin_color)) %>%
  count(skin_color, sort = TRUE)

What else is new in forcats 0.5.0

In addition to the new fct_lump_* functions, existing forcats functions have new argument now. For example,

  • “fct_collapse() now has an argument, other_level, which allows a user-specified Other level”.
  • “Factors are now correctly collapsed when other_level is not NULL, and makes Other the last level”.
  • “fct_reorder2() now has a helper function, first2(), which sorts .y by the first value of .x”.

Check out the change log for all the changes in forcats 0.5.0.

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

reorder boxplot RHow To Reorder a Boxplot in R? Hint: Use forcats Default ThumbnailHow to Recode a Column with dplyr in R? Seaborn Version 0.11.0 is HereSeaborn Version 0.11.0 is here with displot, histplot and ecdfplot Default ThumbnailHow to Use Lambda Functions in Python?

Filed Under: forcats 0.5.0, forcats fct_lump Tagged With: forcats, R

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version