10 Tricks for tidyverse in R - Python and R Tips

Just happened to come across this tweet about David Robinson’s talk on “Ten Tremendous Tricks for Tidyverse”. It looked like a fantastic and useful talk. These ten tricks involve tidyverse functions one may not have heard of or thought of using in a scenario. The first four tidyverse tips is about counting and summarizing, next three tidyverse tips is about visualization with ggplot2 and the last three is all about tidyr functions.

.@drob’s 10 tricks in the Tidyverse!
1. count()
2. count() with its three extra arguments
3. add_count()
4. summarize() with list columns
5. fct_reorder() + geom_col() + coord_flip()
6. fct_lump()
7. scale_x/y_log10()
8. crossing()
9. separate()
10. extract()#rstatsdc #rstats pic.twitter.com/d4vF1IsLtE

— Emily Robinson (@robinson_es) November 9, 2019

All ten tricks are extremely handy and we have already seen many of these tips in our blog before. Still, it is fun put them to use.

In this post, we will see examples all ten tricks for tidyverse with self contained code chunk for each trick. This is an attempt to show the examples of ten tricks without attending the talk. If you really want to learn more tips, make sure to check out David Robinson’s Tidy Tuesday screencast exploring TidyTuesday datasets.

Let us first load tidyverse and set gggplot theme to theme_bw()

library(tidyverse)
library(broom)
theme_set(theme_bw())

Let us load gapminder data with CO2 emission directly from the web.

co2_data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/gapminder_data_with_co2_emission.tsv"
gapminder_co2 <- read_tsv(co2_data)

Quick summary from read_tsv() function.

## Parsed with column specification:
## cols(
##   country = col_character(),
##   year = col_double(),
##   co2 = col_double(),
##   continent = col_character(),
##   lifeExp = col_double(),
##   pop = col_double(),
##   gdpPercap = col_double()
## )

Tidyverse Trick 1: count()

Count() function in dplyr basically counts the number of occurrences specific variable. For example, in gapminder data if we want to get number of countries per continent, we simply use count(continent).

gapminder_co2 %>% 
  count(continent)

## # A tibble: 5 x 2
##   continent     n
##   <chr>     <int>
## 1 Africa      543
## 2 Americas    288
## 3 Asia        312
## 4 Europe      320
## 5 Oceania      24

Without count() function you probably have to group_by continent and summarise to get the results shown above.

Tidyverse Trick 2: count() with three arguments

Count can take multiple arguments. For example, We can specify the name of the new count column that we created with count function using name argument. We can sort the results in descending order with sort argument. Also we can use wt argument to get sum of a variable instead of simple count.

For example, if we wanted total population for each continent and each year, we can use count with the three arguments as follows.

gapminder_co2 %>% 
  count(continent, year, wt=pop,
        name="total_pop", sort=TRUE)

## # A tibble: 60 x 3
##    continent  year  total_pop
##    <chr>     <dbl>      <dbl>
##  1 Asia       2007 3683222531
##  2 Asia       2002 3480310138
##  3 Asia       1997 3268749513
##  4 Asia       1992 3026785976
##  5 Asia       1987 2772278349
##  6 Asia       1982 2518312680
##  7 Asia       1977 2300718259
##  8 Asia       1972 2074847621
##  9 Asia       1967 1827505100
## 10 Asia       1962 1627007479
## # … with 50 more rows

Without using count(), we would first have to use group_by(), summarize() and sort in three steps as follows to get the same results.

gapminder_co2 %>% 
  group_by(continent, year) %>%
  summarize(total_pop=sum(pop)) %>%
  arrange(desc(total_pop))

Tidyverse Trick 3: add_count()

Another useful function in the similar flavor as count() is add_count(). add_count() function with a variable as its argument adds a new column containing number of elements in group.

For example, with gapminder data, when we use add_count(continent), the function will create a new column with name n containing number of elements in each continent.

gapminder_co2 %>% 
  select(continent, lifeExp) %>% 
  add_count()

Here we have selected just two columns from gapminder data for clarity. Note, all rows corresponding to the continent Asia will have the number of rows of Asia in the data.

## # A tibble: 1,487 x 3
##    continent lifeExp     n
##    <chr>       <dbl> <int>
##  1 Asia         28.8  1487
##  2 Asia         30.3  1487
##  3 Asia         32.0  1487
##  4 Asia         34.0  1487
##  5 Asia         36.1  1487
##  6 Asia         38.4  1487
##  7 Asia         39.9  1487
##  8 Asia         40.8  1487
##  9 Asia         41.7  1487
## 10 Asia         41.8  1487
## # … with 1,477 more rows

Tidyverse Trick 4: summarize() with list columns()

summarize() is one of the core verbs of tidyverse from dplyr. It is mainly used to compute some summary statistics. However, it is handy to create list columns and use it when building many statistical models.

Let us use our gapminder data to model how each continent’s total population grew over years by building a linear model using lm.

Let us first compute total population per year for each continent using the count() function above.

df <- gapminder_co2 %>% 
  count(continent, year, wt=pop,
        name="total_pop", sort=TRUE)

Let us build this example step by step to see how we can use summarize function to create list columns and how to use the list columns to perform analysis with many models. In this example, we will make effect plot showing estimate for each continent with its confidence intervals

At first we use summarize to create a list column variable containing linear model for each continent.

df %>%
  group_by(continent) %>%
  summarise(lm_mod= list(lm(total_pop ~year)))

We can see that, now we have a new tibble with list column for the linear model.

## # A tibble: 5 x 2
##   continent lm_mod
##   <chr>     <list>
## 1 Africa    <lm>  
## 2 Americas  <lm>  
## 3 Asia      <lm>  
## 4 Europe    <lm>  
## 5 Oceania   <lm>

Let us use tidy function from broom to get the results from linear model for each continent in tidy form with confidence intervals for estimate.

df %>%
  group_by(continent) %>%
  summarise(lm_mod= list(lm(total_pop ~year))) %>%
  mutate(tidied = map(lm_mod,tidy,conf.int = TRUE))

This creates another list column contatining a small tibble for each continent with linear model results in tidy form.

## # A tibble: 5 x 3
##   continent lm_mod tidied          
##   <chr>     <list> <list>          
## 1 Africa    <lm>   <tibble [2 × 7]>
## 2 Americas  <lm>   <tibble [2 × 7]>
## 3 Asia      <lm>   <tibble [2 × 7]>
## 4 Europe    <lm>   <tibble [2 × 7]>
## 5 Oceania   <lm>   <tibble [2 × 7]>

Let us unnest() the list column “tidied”. This will give us estimates from linear model for each continent. One for intercept and one for the slope estimate.

df %>%
  group_by(continent) %>%
  summarise(lm_mod= list(lm(total_pop ~year))) %>%
  mutate(tidied = map(lm_mod,tidy,conf.int = TRUE)) %>%
  unnest(tidied)

Now we have our estimates from multiple linear models.

## # A tibble: 10 x 9
##    continent lm_mod term  estimate std.error statistic  p.value conf.low
##    <chr>     <list> <chr>    <dbl>     <dbl>     <dbl>    <dbl>    <dbl>
##  1 Africa    <lm>   (Int… -2.60e10    1.22e9     -21.3 1.15e- 9 -2.87e10
##  2 Africa    <lm>   year   1.34e 7    6.17e5      21.7 9.56e-10  1.20e 7
##  3 Americas  <lm>   (Int… -1.95e10    2.48e8     -78.8 2.66e-15 -2.01e10
##  4 Americas  <lm>   year   1.02e 7    1.25e5      81.2 1.96e-15  9.90e 6
##  5 Asia      <lm>   (Int… -8.58e10    1.61e9     -53.3 1.30e-13 -8.94e10
##  6 Asia      <lm>   year   4.46e 7    8.13e5      54.9 9.81e-14  4.28e 7
##  7 Europe    <lm>   (Int… -6.30e 9    2.23e8     -28.3 7.01e-11 -6.80e 9
##  8 Europe    <lm>   year   3.44e 6    1.12e5      30.6 3.28e-11  3.19e 6
##  9 Oceania   <lm>   (Int… -4.83e 8    3.92e6    -123.  3.00e-17 -4.92e 8
## 10 Oceania   <lm>   year   2.53e 5    1.98e3     128.  2.09e-17  2.49e 5
## # … with 1 more variable: conf.high <dbl>

And we are ready to make the co-efficient plot with estimate and confidence intervals for estimate.

df %>%
  group_by(continent) %>%
  summarise(lm_mod= list(lm(total_pop ~year))) %>%
  mutate(tidied = map(lm_mod,tidy,conf.int = TRUE)) %>%
  unnest(tidied) %>%
  filter(term!="(Intercept)") %>%
  ggplot(aes(estimate,continent)) +
  geom_point()+
  geom_errorbarh(aes(xmin=conf.low, xmax=conf.high,height = .3)) +
  labs(title="Total population per continent over years") + theme_bw(base_size=16)

Tidyverse Trick 5: fct_reorder() + geom_col() + coord_flip()

Bar plots are a great way to quickly visualize counts or specific quantity across multiple categories. However, the barplots can be harder to interpret if it is not ordered properly. Check out the earlier for tips to make better plots here. One of the tricks is to combine barplots made using geom_col() with fct_reorder() and coord_flip().

Let us see an example using gapminder data and make a barplots for the amount of CO2 emission per continent for the year 2007.

Within ggplot()’s aes() function we order continents based on their CO2 emission values. Also we use geom_col() to make the barplot with coord_flip(). Flipping the axis is of great use to make the labels on x axis legible.

gapminder_co2 %>% 
  filter(year==2007) %>%
  count(continent, wt=co2, name="total_co2") %>%
  ggplot(aes(x=fct_reorder(continent,total_co2),y=total_co2))+
  geom_col()+
  labs(x="Continent", title="Total CO2 emmission for year 2007")+
  coord_flip()

tidyverse tips: geom_col()+ fct_reorder()+coord_flip()

Tidyverse Trick 6: fct_lump()

Often while working dataset with a variable with numerous factors, a quick standard barchart or boxplot to understand the trend of a variable with so many levels can get really cumbersome. One of the solutions is to focus only on the top factors. The forcats library in tidyverse has a nice function fct_lump() that can lump the bunch of uninteresting factors in to a single factor easily.

Let us see an example of how to use fct_lump() to lump factors in to a new “other” category and visualize using barplot like the example above.

Let us use a dataset on “big” car economy dataset from Tidy Tuesday project.

big_epa_cars <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-10-15/big_epa_cars.csv")

This dataset contains a large number of car makers with their fuel economy for each of their models. Let us make a quick exploratory barplot with each car maker and the number of models they have.

big_epa_cars %>% 
  select(make) %>%
  ggplot(aes(x=fct_rev(fct_infreq(make))))+
  geom_bar() +
  labs(x="Car Make")+
  coord_flip()+
  theme_bw(base_size=16)

We can immediately see that that the barplot is worthless, because there are a lot of car makers with just one or two models.

Typically one might be interested in the carmakers with lot of car models. A better way to visualize is to use fct_lump() to group all car makers with low number of models in to a single category.

For example, if we want to keep only the top 5 makers and lump the remaining carmakers into one big group, we would use fct_lump() like shown below.

big_epa_cars %>% 
  select(make) %>%
  mutate(make_lumped = fct_lump(make,5))

Basically, we used fct_lump() to create a new variable containing the top factors and lumping rest into factor called “other”.

## # A tibble: 41,804 x 2
##    make       make_lumped
##    <chr>      <fct>      
##  1 Alfa Romeo Other      
##  2 Ferrari    Other      
##  3 Dodge      Dodge      
##  4 Dodge      Dodge      
##  5 Subaru     Other      
##  6 Subaru     Other      
##  7 Subaru     Other      
##  8 Toyota     Toyota     
##  9 Toyota     Toyota     
## 10 Toyota     Toyota     
## # … with 41,794 more rows

Now we can make the barplot that we wanted. This time we would use the newly created variable from fct_lump().

big_epa_cars %>% 
  select(make) %>%
  mutate(make_lumped = fct_lump(make,35)) %>%
  ggplot(aes(x=fct_rev(fct_infreq(make_lumped))))+
  geom_bar() +
  labs(x="Car Make")+
  coord_flip()

In this fct_lump() example, we keep the top 35 factors. And we can see the top factors much clearly

tidyverse tips: bar plots with fct_lump()

Tidyverse Trick 7: scale_x/y_log10()

Often when making plots, visualizing them on log scale can easily reveal the trend in the data. For example, scatter plots are great way to visualize the relationship between two quantitative variables. However, sometimes the real relationship between two variables may not be easily visible.

Let us use gapminder data to make scatter plot between gdpPercap and lifeExp.

gapminder_co2 %>% 
  ggplot(aes(x=gdpPercap,y=lifeExp))+
  geom_point()

We can see that our scatter plot is kind of squished along x-axis due to big gdpPercap outliers.

We can make the x-axis log scale in ggplot2 with scale_x_log10().

gapminder_co2 %>% 
  ggplot(aes(x=gdpPercap,y=lifeExp))+
  geom_point() +
  scale_x_log10()

Thanks to the log scale on x-axis, now we can easily see the linear trend clearly.

When needed, we can also scale the y axis to log scale with scale_y_log10() function in ggplot2.

Tidyverse Trick 9: separate()

The last two tips in tidyverse is two related functions. The first is separate() function in tidyr can be of great help when you want to convert a single character column in a data frame into one or more character columns. It is a great way to parse a complicated text column into simple one(s).

Let us consider a simple example where we first create a data frame with a character column.

df <- data.frame(period=c("Q1_y2019","Q2_y2019", "Q3_y2019","Q4_y2019"),
                 revenue=c(23,24,27,29))

Here we have a character column, “period” that has two tokens separated by under score.

##     period revenue
## 1 Q1_y2019      23
## 2 Q2_y2019      24
## 3 Q3_y2019      27
## 4 Q4_y2019      29

We can separate the single character columns in two columns using separate() function. All you need to do is specify the names of the columns you want and optionally specify how you want to split the original character column using “sep” argument.

df %>% 
   separate(period,c("Quarter","Year"))

In this example, we use default delimiter to get two columns from a single column. The argument sep can take regular expression to split the column. Now we have two new columns made from the original column.

##   Quarter  Year revenue
## 1      Q1 y2019      23
## 2      Q2 y2019      24
## 3      Q3 y2019      27
## 4      Q4 y2019      29

Tidyverse Trick 10: extract()

extract from tidyr is another tidyr function that is of the same flavor as separate(). extract() is bit more powerful as one can use regular expression to extract patterns of interest from column containing text and create one or more new columns.

For example, using the same sample dataframe, we can extract the quarter and year information alone using extract() function with regular expression for the pattern we want to extract.

df %>% 
   extract(period,c("Quarter","Year"),"Q(.*)_y(.*)")

Here we extracted Quarter information starting with Q and year informartion starting with y and the resulting data frame would look like this.

##   Quarter Year revenue
## 1       1 2019      23
## 2       2 2019      24
## 3       3 2019      27
## 4       4 2019      29