New version of dplyr, version 1.0.0 is here. It was originally supposed to be available in early May and finally out on CRAN now. One of the cool things with the new dplyr version 1.0.0 is its cool new logo.
Jokes apart, dplyr 1.0.0 is loaded with new features and Hadley Wickham has started teasing with new important features of dplyr 1.0.0 slowly in a series of blogposts starting from March. We have already seen three blogposts on dplyr 1.0.0 covering the big changes.
Here is a quick post on trying out a couple of cool new features of dplyr 1.0.0. This is just a start and look forward to explore and playing with new features of dplyr 1.0.0 more.
In this blogpost, we will go over two new functionalities, summarise() and rowwise() in dplyr 1.0.0. summarize() is an old function, but is loaded with new features that weren’t possible prior to dplyr 1.0.0. And rowwise() has gotten a new life and kind of risen from the ashes since it was discouraged a few years ago.
dplyr 1.0.0 is available from CRAN and can be installed using
install.packages("dplyr")
Let us load dplyr and make sure we have the dplyr version 1.0.0+.
library(dplyr) packageVersion("dplyr") [1] ‘1.0.0’
Let us also load readr and gapminder for reading files and data for playing with the new version of 1.0.0
library(readr) library(gapminder)
summarise() function: dplyr 1.0.0
Let us start with existing summarize() function, that now has cool new features. dplyr’s summarize function is one of the key functions that lets you summarize data from existing data and create single summary value.
One of the simplest uses of summarize() function is to group_by() by a variable and compute summary statistics of another variable for each value of the groupby variable.
For example, with gapminder dataset we can compute mean lifeExp for each continent, by using group_by() on continent variable and using summarize to compute mean lifeExp .
gapminder %>% group_by(continent) %>% summarize(avg_life=mean(lifeExp))
We get one summary value for each continent.
## # A tibble: 5 x 2 ## continent avg_life ## <fct> <dbl> ## 1 Africa 48.9 ## 2 Americas 64.7 ## 3 Asia 60.1 ## 4 Europe 71.9 ## 5 Oceania 74.3
Starting from dplyr 1.0.0, instead of single value, summarize() function can generate a rectangle of arbitrary size. The blogpost introducing the new summarize() says “This is a big change to summarise() but it should have minimal impact on existing code because it broadens the interface: all existing code will continue to work, and a number of inputs that would have previously errored now work.”
Let us say, you don’t just want mean lifeExp value for each continent. Instead, you want multiple quantiles of lifeExp for each continent.
We can use the new summarise() function and compute specific quantiles of interest. In the example below, we specify want 25th, median and 75th quantiles by providing them as a list to quantile function.
gapminder %>% group_by(continent) %>% summarise(lifeExp_q = quantile(lifeExp, c(0.25, 0.5, 0.75)))
Now instead of single row summary for each continent, we get three row summary stats for each continent.
## `summarise()` regrouping output by 'continent' (override with `.groups` argument) ## # A tibble: 15 x 2 ## # Groups: continent [5] ## continent lifeExp_q ## <fct> <dbl> ## 1 Africa 42.4 ## 2 Africa 47.8 ## 3 Africa 54.4 ## 4 Americas 58.4 ## 5 Americas 67.0 ## 6 Americas 71.7 ## 7 Asia 51.4 ## 8 Asia 61.8 ## 9 Asia 69.5 ## 10 Europe 69.6 ## 11 Europe 72.2 ## 12 Europe 75.5 ## 13 Oceania 71.2 ## 14 Oceania 73.7 ## 15 Oceania 77.6
Not just that, we can also create a new variable that specifies which quantile was computed for each continent, i.e each row.
Here we create new variable q for quantiles corresponding to each quantile we computed.
gapminder %>% group_by(continent) %>% summarise(lifeExp_q = quantile(lifeExp, c(0.25, 0.5, 0.75)), q=c(0.25,0.5,0.75))
And the new summarize() creates a new column repeating the quantiles for every three rows.
## # A tibble: 15 x 3 ## # Groups: continent [5] ## continent lifeExp_q q ## <fct> <dbl> <dbl> ## 1 Africa 42.4 0.25 ## 2 Africa 47.8 0.5 ## 3 Africa 54.4 0.75 ## 4 Americas 58.4 0.25 ## 5 Americas 67.0 0.5 ## 6 Americas 71.7 0.75 ## 7 Asia 51.4 0.25 ## 8 Asia 61.8 0.5 ## 9 Asia 69.5 0.75 ## 10 Europe 69.6 0.25 ## 11 Europe 72.2 0.5 ## 12 Europe 75.5 0.75 ## 13 Oceania 71.2 0.25 ## 14 Oceania 73.7 0.5 ## 15 Oceania 77.6 0.75
This example just illustrates the simple use of summarize and look forward to learning the more powerful use of it soon.
rowwise() function: dplyr 1.0.0
Rowwise() is the second functionality we will focus on now. Let us see a simple example of using rowwise() function to perform rowwise operation.
We will again turn to gapminder data, but this time gapminder data in wide form.
data_url <- "https://raw.githubusercontent.com/cmdlinetips/data/master/gapminder/gapminder_all.csv" df <- read_csv(data_url)
We can see that the gapminder data is in wide for with gdp,lifeExp, and pop over the years as columns/variables.
head(df) ## # A tibble: 6 x 38 ## continent country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Africa Algeria 2449. 3014. 2551. 3247. ## 2 Africa Angola 3521. 3828. 4269. 5523. ## 3 Africa Benin 1063. 960. 949. 1036. ## 4 Africa Botswa… 851. 918. 984. 1215. ## 5 Africa Burkin… 543. 617. 723. 795. ## 6 Africa Burundi 339. 380. 355. 413. ## # … with 32 more variables: gdpPercap_1972 <dbl>, gdpPercap_1977 <dbl>, ## # gdpPercap_1982 <dbl>, gdpPercap_1987 <dbl>, gdpPercap_1992 <dbl>, ## # gdpPercap_1997 <dbl>, gdpPercap_2002 <dbl>, gdpPercap_2007 <dbl>, ## # lifeExp_1952 <dbl>, lifeExp_1957 <dbl>, lifeExp_1962 <dbl>, ## # lifeExp_1967 <dbl>, lifeExp_1972 <dbl>, lifeExp_1977 <dbl>, ## # lifeExp_1982 <dbl>, lifeExp_1987 <dbl>, lifeExp_1992 <dbl>, ## # lifeExp_1997 <dbl>, lifeExp_2002 <dbl>, lifeExp_2007 <dbl>, pop_1952 <dbl>, ## # pop_1957 <dbl>, pop_1962 <dbl>, pop_1967 <dbl>, pop_1972 <dbl>, ## # pop_1977 <dbl>, pop_1982 <dbl>, pop_1987 <dbl>, pop_1992 <dbl>, ## # pop_1997 <dbl>, pop_2002 <dbl>, pop_2007 <dbl>
Let us say we want to rowwise operation and compute mean values of multiple columns. Till dplyr 1.0.0, we would have to reshape the data with pivot_longer and compute the summary statistics. However, with dplyr version 1.0.0 we can use rowwise() function and compute things for each row.
Let us first try rowwise() function to compute mean values of three lifeExp columns and three gdpPercap columns. We will first apply rowwise() to continent and country variables to keep each row and then use summarise function to compute mean for three columns of gdpPercap and lifeExp
df %>% rowwise(continent,country) %>% summarise(mean_gdp = mean(c(gdpPercap_1952, gdpPercap_1957, gdpPercap_1962)), mean_life = mean(c(lifeExp_1952, lifeExp_1957, lifeExp_1962)))
And we nicely get back mean values of three columns in each row.
## # A tibble: 142 x 4 ## # Groups: continent, country [142] ## continent country mean_gdp mean_life ## <chr> <chr> <dbl> <dbl> ## 1 Africa Algeria 2671. 45.7 ## 2 Africa Angola 3873. 32.0 ## 3 Africa Benin 991. 40.4 ## 4 Africa Botswana 918. 49.6 ## 5 Africa Burkina Faso 628. 34.9 ## 6 Africa Burundi 358. 40.5 ## 7 Africa Cameroon 1295. 40.5 ## 8 Africa Central African Republic 1152. 37.5 ## 9 Africa Chad 1292. 39.9 ## 10 Africa Comoros 1240. 42.5 ## # … with 132 more rows
We can immediately see that our gapminder data has number of columns and we computed mean of just three columns by specifying the names as vector.
With rowsise() and a new c_across() function we can perform the rowwise summary operation more generally.
c_across() is designed to work with rowwise() to make it easy to perform row-wise aggregations. It has two differences from c():
It uses tidy select semantics so you can easily select multiple variables. See vignette(“rowwise”) for more details.
It uses vctrs::vec_c() in order to give safer outputs.
c_across() helps us select columns starts with the two prefixes here and compute mean
df %>% rowwise(continent,country) %>% summarise(mean_gdp = mean(c_across(starts_with("gdp"))), mean_life = mean(c_across(starts_with("life"))))
And we get mean lifeExp and gdpPercap for each row, but using all corresponding data.
## # A tibble: 142 x 4 ## # Groups: continent, country [142] ## continent country mean_gdp mean_life ## <chr> <chr> <dbl> <dbl> ## 1 Africa Algeria 4426. 59.0 ## 2 Africa Angola 3607. 37.9 ## 3 Africa Benin 1155. 48.8 ## 4 Africa Botswana 5032. 54.6 ## 5 Africa Burkina Faso 844. 44.7 ## 6 Africa Burundi 472. 44.8 ## 7 Africa Cameroon 1775. 48.1 ## 8 Africa Central African Republic 959. 43.9 ## 9 Africa Chad 1165. 46.8 ## 10 Africa Comoros 1314. 52.4 ## # … with 132 more rows
We can combine rowwise() with summarize() function to compute multiple quantiles, not just mean as before. Isn’t that cool?
df %>% rowwise(continent,country) %>% summarise(mean_gdp = quantile(c_across(starts_with("gdp")), c(0.25, 0.5, 0.75)), mean_life = quantile(c_across(starts_with("life")), c(0.25, 0.5, 0.75)), q = c(0.25,0.5,0.75))
That is it for now. Look out for more exploration with dplyr 1.0.0 in the coming posts.