forcats, one of the key tidyverse R packages, for dealing with factors in R has a new version 0.5.0 on CRAN with a lot of new changes.
If you have not heard of forcats before, it is a R package part of tidyverse that provides “a suite of tools that solve common problems with factors, including changing the order of levels or the values”. For example, fct_reorder() from forcats is one of the commonly used functions from forcats for reordering factor levels. And fct_lump() is another function in forcats that helps lumping factor levels in your data in a variety of ways. We can use fct_lump() to collapse the least frequent values of a factor into a new category called “other”.
Starting with the new version, you have four new fct_lump() family functions that does specific type of collapsing. The four new functions fct_lump_min(), fct_lump_prop(), fct_lump_n() and fct_lump_lowfreq() divides the orginal forcats function fct_lump()’s capabilities. The original fct_lump() function will exist for historical reasons and Rstudio team “no longer recommend that you use it”.
- fct_lump_min(): lumps levels that appear fewer than min times.
- fct_lump_prop(): lumps levels that appear fewer than prop
- fct_lump_n(): lumps all levels except for the n most frequent (or least frequent, if n < 0).
- fct_lump_lowfreq(): lumps together the least frequent levels, ensuring that “Other” is still the smallest level.
Let us work through examples of the new fct_lump family of functions with starwars data as tidyverse vignette for forcats. We can install the new version of forcats from CRAN using
# install forcats install.packages("forcats")
And let us verify we have the new forcats version installed with packageVersion.
# check we have the version 0.5.0 packageVersion("forcats") ## [1] '0.5.0'
Let us load the packages needed.
library(tidyverse)
We will use the starwars data to illustrate the use cases of the new fct_lump_*() functions.
starwars %>% head() ## # A tibble: 6 x 13 ## name height mass hair_color skin_color eye_color birth_year gender homeworld ## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> ## 1 Luke… 172 77 blond fair blue 19 male Tatooine ## 2 C-3PO 167 75 <NA> gold yellow 112 <NA> Tatooine ## 3 R2-D2 96 32 <NA> white, bl… red 33 <NA> Naboo ## 4 Dart… 202 136 none white yellow 41.9 male Tatooine ## 5 Leia… 150 49 brown light brown 19 female Alderaan ## 6 Owen… 178 120 brown, gr… light blue 52 male Tatooine ## # … with 4 more variables: species <chr>, films <list>, vehicles <list>, ## # starships <list>
The factor levels we will focus on and lump is skin_color of starwars characters. We can see that there are 31 different skin colors with different frequency.
starwars %>% count(skin_color, sort = TRUE) ## # A tibble: 31 x 2 ## skin_color n ## <chr> <int> ## 1 fair 17 ## 2 light 11 ## 3 dark 6 ## 4 green 6 ## 5 grey 6 ## 6 pale 5 ## 7 brown 4 ## 8 blue 2 ## 9 blue, grey 2 ## 10 orange 2 ## # … with 21 more rows
fct_lump_min()
fct_lump_min() function lumps factor levels that appear less than min times. In this example, we are lumping levels that appear fewer than 5 times in starwars data and assign to new category “Other”.
starwars %>% mutate(skin_color = fct_lump_min(skin_color, 5)) %>% count(skin_color, sort = TRUE)
We can see that after lumping all the levels that appear fewer than 5 times, our factor level counts would like the tibble shown below.
## # A tibble: 7 x 2 ## skin_color n ## <fct> <int> ## 1 Other 36 ## 2 fair 17 ## 3 light 11 ## 4 dark 6 ## 5 green 6 ## 6 grey 6 ## 7 pale 5
fct_lump_n()
fct_lump_n() lumps all levels except for the n most frequent levels into “Other” category. Let us use fct_lump_n() on skin_color and keep top 5 levels and we get new factor level counts as shown below.
starwars %>% mutate(skin_color = fct_lump_n(skin_color, 5)) %>% count(skin_color, sort = TRUE)
“Other” category is most frequent now and the rest are the 5 most frequent factor levels in the star war data.
## # A tibble: 6 x 2 ## skin_color n ## <fct> <int> ## 1 Other 41 ## 2 fair 17 ## 3 light 11 ## 4 dark 6 ## 5 green 6 ## 6 grey 6
fct_lump_prop()
Sometime you might want lump factor levels based on their proportion not counts. fct_lump_prop() lumps levels that appear in fewer than prop of the time. For example, to keep skin colors that appear in at least 10% of the characters we would use fct_lump_prop() on skin_color with 0.1 as shown below.
starwars %>% mutate(skin_color = fct_lump_prop(skin_color, 0.1)) %>% count(skin_color, sort = TRUE)
In this case we lump all levels except two levels into “Other”.
## # A tibble: 3 x 2 ## skin_color n ## <fct> <int> ## 1 Other 59 ## 2 fair 17 ## 3 light 11
fct_lump_lowfreq()
fct_lump_lowfreq() lumps together the least frequent levels, ensuring that “Other” is still the smallest level. If we try fct_lump_lowfreq() on starwars data, nothing would really change. This is mainly because the least frequent level has counts 1 and we can not lump further ensuring “Other” is still the smallest level.
starwars %>% mutate(skin_color = fct_lump_lowfreq(skin_color)) %>% count(skin_color, sort = TRUE)
What else is new in forcats 0.5.0
In addition to the new fct_lump_* functions, existing forcats functions have new argument now. For example,
- “fct_collapse() now has an argument, other_level, which allows a user-specified Other level”.
- “Factors are now correctly collapsed when other_level is not NULL, and makes Other the last level”.
- “fct_reorder2() now has a helper function, first2(), which sorts .y by the first value of .x”.
Check out the change log for all the changes in forcats 0.5.0.