Here are four tidyverse tips for future self. These four tips/functions from tidyverse suite are a few of really simple things that I need often, but I always have to google and often struggle to come up with the search phrase.
The first tip is very simple and extremely useful function case_when() from dplyr package. The case_when() function helps you replace a scenario where you have to use multiple if_else() functions for checking for multiple conditions.
The second and third tips/functions are from forcats library for dealing with factors. When you make a barplot with multiple groups, ggplot2 places the groups/categories of the barplot in alphabetical order. Often you might be working with factors that are ordinal, i.e. they have default order and meaning associated with the order. The second tips shows how to use forcats fct_relevel() function and order the bars of barplot in the order that you specify. And the third tip show how to change the levels manually using fct_recode() function in forcats package.
The fourth tip is about coloring barplots with color range that reflects the order of the bars in the barplot instead of the height of the bars.
Let us start with loading tidyverse.
library(tidyverse)
One of the common tasks while doing data munging is to create a new “simplified” variable based on an existing variable. We might have to use multiple if conditions on the existing variable and crate a value for the new variable. In the simple example below, we have scores out 100 for a bunch of students. And we want to assign descriptive grade categories based on the score ranges.
We want to create a new variable with a grade for student based on their score. For example, students with scores greater than or equal to 90 get “Very High” and students with scores between 80 and 90 get “High”, and so on.
set.seed(40) scores <- sample(seq(100),replace = TRUE) df <- tibble(scores) df %>% head() ## # A tibble: 6 x 1 ## scores ## <int> ## 1 44 ## 2 46 ## 3 18 ## 4 27 ## 5 66 ## 6 56
One can immediately see multiple if conditions can be useful here. However, dplyr’s case_when() is fantastic for such situations. As dplyr’s document page says
This function allows you to vectorise multiple if_else() statements. It is an R equivalent of the SQL CASE WHEN statement. If no cases match, NA is returned.
With case_when, we can have multiple conditions that we want to check. For each conditional statement, we specify the condition on left hand side that we would like to check and specify the value on right hand side when the condition is TRUE. We start with all specific conditions we would like to check first and the last statement is for catching all other conditions we have not specified using TRUE value for condition.
In the example below we check scores and assign a descriptive value for score ranges 90 and above, 80 to 9-, 60 to 80 , and 40 to 60. And the last statement is for any score less than 40.
df <- df %>% mutate(grade = case_when( scores >= 90 ~ "Very High", scores >= 80 & scores < 90 ~ "High", scores >= 60 & scores < 80 ~ "Medium", scores >= 40 & scores < 60 ~ "Low", TRUE ~ "Very Low" ))
And we get
df %>% head() ## # A tibble: 6 x 2 ## scores grade ## <int> <chr> ## 1 44 Low ## 2 46 Low ## 3 18 Very Low ## 4 27 Very Low ## 5 66 Medium ## 6 56 Low
Another useful tip while using case_when() is that, we can wrap our complex case_when into a function and use it as follows to get the same result. Here is an example of that, where we wrote our case_when statements inside a function and use the function.
# a function for complex case_when statement my_case_when <- function(scores){ case_when( scores >= 90 ~ "Very High", scores >= 80 & scores < 90 ~ "High", scores >= 60 & scores < 80 ~ "Medium", scores >= 40 & scores < 60 ~ "Low", TRUE ~ "Very Low" ) }
# example using case_when in a function df %>% mutate(grade = my_case_when(scores))
And we get the same result as above.
## # A tibble: 6 x 2 ## scores grade ## <int> <chr> ## 1 44 Low ## 2 46 Low ## 3 18 Very Low ## 4 27 Very Low ## 5 66 Medium ## 6 56 Low ## 7 ## 8
How to Manually Order X-axis in a Barplot with ggplot2?
Second tip is how to manually specify the order of barplot. When we make a barplot, by default it orders the variable on x-axis in alphabetical order. Let us make barplot to see the order of x-axis.
df %>% count(grade) %>% ggplot(aes(x=grade, y=n))+ geom_col()
Let us say we would like to order our barplot with counts on y-axis and descriptive grades on x-axis, but with the order we specify.
In this case, we would like to see counts of barplot going from “Very Low” score to “Very High” score. Let us first specify the order we would like to have in a variable.
ordered_grades <- c("Very Low", "Low", "Medium","High","Very High")
Then we can use fct_relevel() function from forcats package to change the levels used in the barplot by specifying the variable name for which we would like to re-level and the order we would like..
df %>% count(grade) %>% mutate(grade=forcats::fct_relevel(grade, ordered_grades)) %>% ggplot(aes(x=grade, y=n))+ geom_col()
Now our barplot’s x-axis has the levels in the order that we want.
How to Manually Change the levels of a categorical variable with fct_recode() in forcats?
Sometimes we would like to change the levels of a categorical variables. In our example, we created descriptive levels for grades. Let us say we want to change the descriptive grade levels to commonly used grade levels using letters A+ to F representing the very high scores to very low scores.
We can use fct_recode() function in forcats package to manually change the levels to what we want. We provide the variable name that we would like to recode or change levels first and then assign the new level values to the old level values for each of the levels.
df %>% count(grade) %>% mutate(grade=forcats::fct_relevel(grade, ordered_grades)) %>% mutate(grade=forcats::fct_recode(grade, "A+"= "Very High", "A" = "High", "B"="Medium", "C"="Low", "F"="Very Low"))
And we get a tibble with new levels.
## # A tibble: 5 x 2 ## grade n ## <fct> <int> ## 1 A 9 ## 2 C 21 ## 3 B 16 ## 4 A+ 11 ## 5 F 43
Let us make the same barplot, but this time with new levels that we manually changed.
df %>% count(grade) %>% mutate(grade=forcats::fct_relevel(grade, ordered_grades)) %>% mutate(grade=forcats::fct_recode(grade, "A+"= "Very High", "A" = "High", "B"="Medium", "C"="Low", "F"="Very Low")) %>% ggplot(aes(x=grade, y=n))+ geom_col()
Fourth tips is about coloring barplots. Often want to color bars in a continuous colors or divergent colors but based on the values of x-axis levels, not the bar heights. In this example we are using we would like to use sequential/diverging colors, i.e. we would like to color F with darker red and A+ with lighter red. Or red for “F”s and green for “A+”. And I always struggle and think of continuous colors and try out all wrong options like scale_fill_continuous() and scale_fill_gradient() before trying the correct one.
And the right function to color bars here is to use scale_fill_brewer() with palette that is sequential/diverging, like “Reds”/”RdYlGn”.
df %>% count(grade) %>% mutate(grade=forcats::fct_relevel(grade, ordered_grades)) %>% mutate(grade=forcats::fct_recode(grade, "A+"= "Very High", "A" = "High", "B"="Medium", "C"="Low", "F"="Very Low")) %>% ggplot(aes(x=grade, y=n, fill=grade))+ geom_col()+ scale_fill_brewer(palette="Reds", direction=-1)
Here is an example coloring the bars using scale_fill_brewer(), with diverging color palette “RdYlGn”.
df %>% count(grade) %>% mutate(grade=forcats::fct_relevel(grade, ordered_grades)) %>% mutate(grade=forcats::fct_recode(grade, "A+"= "Very High", "A" = "High", "B"="Medium", "C"="Low", "F"="Very Low")) %>% ggplot(aes(x=grade, y=n, fill=grade))+ geom_col()+ scale_fill_brewer(palette="RdYlGn")