Introduction to the new lumping functions in forcats version 0.5.0

forcats Version is here with four new fct_lump functions
forcats version 0.5.0

forcats version 0.5.0
forcats, one of the key tidyverse R packages, for dealing with factors in R has a new version 0.5.0 on CRAN with a lot of new changes.

If you have not heard of forcats before, it is a R package part of tidyverse that provides “a suite of tools that solve common problems with factors, including changing the order of levels or the values”. For example, fct_reorder() from forcats is one of the commonly used functions from forcats for reordering factor levels. And fct_lump() is another function in forcats that helps lumping factor levels in your data in a variety of ways. We can use fct_lump() to collapse the least frequent values of a factor into a new category called “other”.

Starting with the new version, you have four new fct_lump() family functions that does specific type of collapsing. The four new functions fct_lump_min(), fct_lump_prop(), fct_lump_n() and fct_lump_lowfreq() divides the orginal forcats function fct_lump()’s capabilities. The original fct_lump() function will exist for historical reasons and Rstudio team “no longer recommend that you use it”.

  1. fct_lump_min(): lumps levels that appear fewer than min times.
  2. fct_lump_prop(): lumps levels that appear fewer than prop
  3. fct_lump_n(): lumps all levels except for the n most frequent (or least frequent, if n < 0).
  4. fct_lump_lowfreq(): lumps together the least frequent levels, ensuring that “Other” is still the smallest level.

Let us work through examples of the new fct_lump family of functions with starwars data as tidyverse vignette for forcats. We can install the new version of forcats from CRAN using

# install forcats
install.packages("forcats")

And let us verify we have the new forcats version installed with packageVersion.

# check we have the version 0.5.0
packageVersion("forcats")

## [1] '0.5.0'

Let us load the packages needed.

library(tidyverse)

We will use the starwars data to illustrate the use cases of the new fct_lump_*() functions.

starwars %>% head()

## # A tibble: 6 x 13
##   name  height  mass hair_color skin_color eye_color birth_year gender homeworld
##   &lt;chr&gt;  &lt;int&gt; &lt;dbl&gt; &lt;chr&gt;      &lt;chr&gt;      &lt;chr&gt;          &lt;dbl&gt; &lt;chr&gt;  &lt;chr&gt;    
## 1 Luke…    172    77 blond      fair       blue            19   male   Tatooine 
## 2 C-3PO    167    75 &lt;NA&gt;       gold       yellow         112   &lt;NA&gt;   Tatooine 
## 3 R2-D2     96    32 &lt;NA&gt;       white, bl… red             33   &lt;NA&gt;   Naboo    
## 4 Dart…    202   136 none       white      yellow          41.9 male   Tatooine 
## 5 Leia…    150    49 brown      light      brown           19   female Alderaan 
## 6 Owen…    178   120 brown, gr… light      blue            52   male   Tatooine 
## # … with 4 more variables: species &lt;chr&gt;, films &lt;list&gt;, vehicles &lt;list&gt;,
## #   starships &lt;list&gt;

The factor levels we will focus on and lump is skin_color of starwars characters. We can see that there are 31 different skin colors with different frequency.

starwars %>%
  count(skin_color, sort = TRUE)

## # A tibble: 31 x 2
##    skin_color     n
##    &lt;chr&gt;      &lt;int&gt;
##  1 fair          17
##  2 light         11
##  3 dark           6
##  4 green          6
##  5 grey           6
##  6 pale           5
##  7 brown          4
##  8 blue           2
##  9 blue, grey     2
## 10 orange         2
## # … with 21 more rows

fct_lump_min()

fct_lump_min() function lumps factor levels that appear less than min times. In this example, we are lumping levels that appear fewer than 5 times in starwars data and assign to new category “Other”.

starwars %>%
  mutate(skin_color = fct_lump_min(skin_color, 5)) %>%
  count(skin_color, sort = TRUE)

We can see that after lumping all the levels that appear fewer than 5 times, our factor level counts would like the tibble shown below.

## # A tibble: 7 x 2
##   skin_color     n
##   &lt;fct&gt;      &lt;int&gt;
## 1 Other         36
## 2 fair          17
## 3 light         11
## 4 dark           6
## 5 green          6
## 6 grey           6
## 7 pale           5

fct_lump_n()

fct_lump_n() lumps all levels except for the n most frequent levels into “Other” category. Let us use fct_lump_n() on skin_color and keep top 5 levels and we get new factor level counts as shown below.

starwars %>%
  mutate(skin_color = fct_lump_n(skin_color, 5)) %>%
  count(skin_color, sort = TRUE)

“Other” category is most frequent now and the rest are the 5 most frequent factor levels in the star war data.

## # A tibble: 6 x 2
##   skin_color     n
##   &lt;fct&gt;      &lt;int&gt;
## 1 Other         41
## 2 fair          17
## 3 light         11
## 4 dark           6
## 5 green          6
## 6 grey           6

fct_lump_prop()

Sometime you might want lump factor levels based on their proportion not counts. fct_lump_prop() lumps levels that appear in fewer than prop of the time. For example, to keep skin colors that appear in at least 10% of the characters we would use fct_lump_prop() on skin_color with 0.1 as shown below.

starwars %>%
  mutate(skin_color = fct_lump_prop(skin_color, 0.1)) %>%
  count(skin_color, sort = TRUE)

In this case we lump all levels except two levels into “Other”.

## # A tibble: 3 x 2
##   skin_color     n
##   &lt;fct&gt;      &lt;int&gt;
## 1 Other         59
## 2 fair          17
## 3 light         11

fct_lump_lowfreq()

fct_lump_lowfreq() lumps together the least frequent levels, ensuring that “Other” is still the smallest level. If we try fct_lump_lowfreq() on starwars data, nothing would really change. This is mainly because the least frequent level has counts 1 and we can not lump further ensuring “Other” is still the smallest level.

starwars %>%
  mutate(skin_color = fct_lump_lowfreq(skin_color)) %>%
  count(skin_color, sort = TRUE)

What else is new in forcats 0.5.0

In addition to the new fct_lump_* functions, existing forcats functions have new argument now. For example,

  • “fct_collapse() now has an argument, other_level, which allows a user-specified Other level”.
  • “Factors are now correctly collapsed when other_level is not NULL, and makes Other the last level”.
  • “fct_reorder2() now has a helper function, first2(), which sorts .y by the first value of .x”.

Check out the change log for all the changes in forcats 0.5.0.