dplyr 1.0.0 is here: Quick fun with Summarise() and rowwise()

dplyr 1.0.0 is here
dplyr 1.0.0 is here
dplyr 1.0.0 is here
dplyr 1.0.0 is here

New version of dplyr, version 1.0.0 is here. It was originally supposed to be available in early May and finally out on CRAN now. One of the cool things with the new dplyr version 1.0.0 is its cool new logo.

Jokes apart, dplyr 1.0.0 is loaded with new features and Hadley Wickham has started teasing with new important features of dplyr 1.0.0 slowly in a series of blogposts starting from March. We have already seen three blogposts on dplyr 1.0.0 covering the big changes.

Here is a quick post on trying out a couple of cool new features of dplyr 1.0.0. This is just a start and look forward to explore and playing with new features of dplyr 1.0.0 more.

In this blogpost, we will go over two new functionalities, summarise() and rowwise() in dplyr 1.0.0. summarize() is an old function, but is loaded with new features that weren’t possible prior to dplyr 1.0.0. And rowwise() has gotten a new life and kind of risen from the ashes since it was discouraged a few years ago.

dplyr 1.0.0 is available from CRAN and can be installed using

install.packages("dplyr")

Let us load dplyr and make sure we have the dplyr version 1.0.0+.

library(dplyr)
packageVersion("dplyr")
[1] ‘1.0.0’

Let us also load readr and gapminder for reading files and data for playing with the new version of 1.0.0

library(readr)
library(gapminder)

summarise() function: dplyr 1.0.0

Let us start with existing summarize() function, that now has cool new features. dplyr’s summarize function is one of the key functions that lets you summarize data from existing data and create single summary value.

One of the simplest uses of summarize() function is to group_by() by a variable and compute summary statistics of another variable for each value of the groupby variable.
For example, with gapminder dataset we can compute mean lifeExp for each continent, by using group_by() on continent variable and using summarize to compute mean lifeExp .

gapminder %>% 
  group_by(continent) %>%
  summarize(avg_life=mean(lifeExp))

We get one summary value for each continent.

## # A tibble: 5 x 2
##   continent avg_life
##   <fct>        <dbl>
## 1 Africa        48.9
## 2 Americas      64.7
## 3 Asia          60.1
## 4 Europe        71.9
## 5 Oceania       74.3

Starting from dplyr 1.0.0, instead of single value, summarize() function can generate a rectangle of arbitrary size. The blogpost introducing the new summarize() says “This is a big change to summarise() but it should have minimal impact on existing code because it broadens the interface: all existing code will continue to work, and a number of inputs that would have previously errored now work.”

Let us say, you don’t just want mean lifeExp value for each continent. Instead, you want multiple quantiles of lifeExp for each continent.

We can use the new summarise() function and compute specific quantiles of interest. In the example below, we specify want 25th, median and 75th quantiles by providing them as a list to quantile function.

gapminder %>% 
  group_by(continent) %>% 
  summarise(lifeExp_q = quantile(lifeExp, c(0.25, 0.5, 0.75)))

Now instead of single row summary for each continent, we get three row summary stats for each continent.

## `summarise()` regrouping output by 'continent' (override with `.groups` argument)
## # A tibble: 15 x 2
## # Groups:   continent [5]
##    continent lifeExp_q
##    <fct>         <dbl>
##  1 Africa         42.4
##  2 Africa         47.8
##  3 Africa         54.4
##  4 Americas       58.4
##  5 Americas       67.0
##  6 Americas       71.7
##  7 Asia           51.4
##  8 Asia           61.8
##  9 Asia           69.5
## 10 Europe         69.6
## 11 Europe         72.2
## 12 Europe         75.5
## 13 Oceania        71.2
## 14 Oceania        73.7
## 15 Oceania        77.6

Not just that, we can also create a new variable that specifies which quantile was computed for each continent, i.e each row.

Here we create new variable q for quantiles corresponding to each quantile we computed.


gapminder %>% 
  group_by(continent) %>% 
  summarise(lifeExp_q = quantile(lifeExp, c(0.25, 0.5, 0.75)),
            q=c(0.25,0.5,0.75))

And the new summarize() creates a new column repeating the quantiles for every three rows.

## # A tibble: 15 x 3
## # Groups:   continent [5]
##    continent lifeExp_q     q
##    <fct>         <dbl> <dbl>
##  1 Africa         42.4  0.25
##  2 Africa         47.8  0.5 
##  3 Africa         54.4  0.75
##  4 Americas       58.4  0.25
##  5 Americas       67.0  0.5 
##  6 Americas       71.7  0.75
##  7 Asia           51.4  0.25
##  8 Asia           61.8  0.5 
##  9 Asia           69.5  0.75
## 10 Europe         69.6  0.25
## 11 Europe         72.2  0.5 
## 12 Europe         75.5  0.75
## 13 Oceania        71.2  0.25
## 14 Oceania        73.7  0.5 
## 15 Oceania        77.6  0.75

This example just illustrates the simple use of summarize and look forward to learning the more powerful use of it soon.

rowwise() function: dplyr 1.0.0

Rowwise() is the second functionality we will focus on now. Let us see a simple example of using rowwise() function to perform rowwise operation.

We will again turn to gapminder data, but this time gapminder data in wide form.

data_url <- "https://raw.githubusercontent.com/cmdlinetips/data/master/gapminder/gapminder_all.csv"
df <- read_csv(data_url)

We can see that the gapminder data is in wide for with gdp,lifeExp, and pop over the years as columns/variables.

head(df)

## # A tibble: 6 x 38
##   continent country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 gdpPercap_1967
##   <chr>     <chr>            <dbl>          <dbl>          <dbl>          <dbl>
## 1 Africa    Algeria          2449.          3014.          2551.          3247.
## 2 Africa    Angola           3521.          3828.          4269.          5523.
## 3 Africa    Benin            1063.           960.           949.          1036.
## 4 Africa    Botswa…           851.           918.           984.          1215.
## 5 Africa    Burkin…           543.           617.           723.           795.
## 6 Africa    Burundi           339.           380.           355.           413.
## # … with 32 more variables: gdpPercap_1972 <dbl>, gdpPercap_1977 <dbl>,
## #   gdpPercap_1982 <dbl>, gdpPercap_1987 <dbl>, gdpPercap_1992 <dbl>,
## #   gdpPercap_1997 <dbl>, gdpPercap_2002 <dbl>, gdpPercap_2007 <dbl>,
## #   lifeExp_1952 <dbl>, lifeExp_1957 <dbl>, lifeExp_1962 <dbl>,
## #   lifeExp_1967 <dbl>, lifeExp_1972 <dbl>, lifeExp_1977 <dbl>,
## #   lifeExp_1982 <dbl>, lifeExp_1987 <dbl>, lifeExp_1992 <dbl>,
## #   lifeExp_1997 <dbl>, lifeExp_2002 <dbl>, lifeExp_2007 <dbl>, pop_1952 <dbl>,
## #   pop_1957 <dbl>, pop_1962 <dbl>, pop_1967 <dbl>, pop_1972 <dbl>,
## #   pop_1977 <dbl>, pop_1982 <dbl>, pop_1987 <dbl>, pop_1992 <dbl>,
## #   pop_1997 <dbl>, pop_2002 <dbl>, pop_2007 <dbl>

Let us say we want to rowwise operation and compute mean values of multiple columns. Till dplyr 1.0.0, we would have to reshape the data with pivot_longer and compute the summary statistics. However, with dplyr version 1.0.0 we can use rowwise() function and compute things for each row.

Let us first try rowwise() function to compute mean values of three lifeExp columns and three gdpPercap columns. We will first apply rowwise() to continent and country variables to keep each row and then use summarise function to compute mean for three columns of gdpPercap and lifeExp

df %>%
  rowwise(continent,country) %>%
  summarise(mean_gdp = mean(c(gdpPercap_1952, gdpPercap_1957, gdpPercap_1962)),
            mean_life = mean(c(lifeExp_1952, lifeExp_1957, lifeExp_1962)))

And we nicely get back mean values of three columns in each row.

## # A tibble: 142 x 4
## # Groups:   continent, country [142]
##    continent country                  mean_gdp mean_life
##    <chr>     <chr>                       <dbl>     <dbl>
##  1 Africa    Algeria                     2671.      45.7
##  2 Africa    Angola                      3873.      32.0
##  3 Africa    Benin                        991.      40.4
##  4 Africa    Botswana                     918.      49.6
##  5 Africa    Burkina Faso                 628.      34.9
##  6 Africa    Burundi                      358.      40.5
##  7 Africa    Cameroon                    1295.      40.5
##  8 Africa    Central African Republic    1152.      37.5
##  9 Africa    Chad                        1292.      39.9
## 10 Africa    Comoros                     1240.      42.5
## # … with 132 more rows

We can immediately see that our gapminder data has number of columns and we computed mean of just three columns by specifying the names as vector.

With rowsise() and a new c_across() function we can perform the rowwise summary operation more generally.

c_across() is designed to work with rowwise() to make it easy to perform row-wise aggregations. It has two differences from c():

It uses tidy select semantics so you can easily select multiple variables. See vignette(“rowwise”) for more details.

It uses vctrs::vec_c() in order to give safer outputs.

c_across() helps us select columns starts with the two prefixes here and compute mean

df %>%
  rowwise(continent,country) %>%
  summarise(mean_gdp = mean(c_across(starts_with("gdp"))),
            mean_life = mean(c_across(starts_with("life"))))

And we get mean lifeExp and gdpPercap for each row, but using all corresponding data.

## # A tibble: 142 x 4
## # Groups:   continent, country [142]
##    continent country                  mean_gdp mean_life
##    <chr>     <chr>                       <dbl>     <dbl>
##  1 Africa    Algeria                     4426.      59.0
##  2 Africa    Angola                      3607.      37.9
##  3 Africa    Benin                       1155.      48.8
##  4 Africa    Botswana                    5032.      54.6
##  5 Africa    Burkina Faso                 844.      44.7
##  6 Africa    Burundi                      472.      44.8
##  7 Africa    Cameroon                    1775.      48.1
##  8 Africa    Central African Republic     959.      43.9
##  9 Africa    Chad                        1165.      46.8
## 10 Africa    Comoros                     1314.      52.4
## # … with 132 more rows

We can combine rowwise() with summarize() function to compute multiple quantiles, not just mean as before. Isn’t that cool?

df %>%
  rowwise(continent,country) %>%
  summarise(mean_gdp = quantile(c_across(starts_with("gdp")),
                                c(0.25, 0.5, 0.75)),
            mean_life = quantile(c_across(starts_with("life")),
                                 c(0.25, 0.5, 0.75)),
            q = c(0.25,0.5,0.75))

That is it for now. Look out for more exploration with dplyr 1.0.0 in the coming posts.