R For Data Science Book Gets tidyr 1.0.0 Friendly

R for Data Science book by Garrett Grolemund and Hadley Wickham is the best book for doing data science with tidyverse. tidyverse, the meta-package, has loads of useful packages like tidyr, dplyr, and ggplot2 to make your life as data scientist easy. Last fall, tidyr package got a big update with version 1.0.0.

Until now, before tidyr 1.0.0, gather() and. spread() are two key functions of tidyr package that help reshape data; from long form to wide form and from wide form to long form. Although these functions were amazing, it was difficult to use them without looking at the docs everytime (even for Hadley Wickham).

While introducing tidyr, Hadely Wickham wrote

For some time, it’s been obvious that there is something fundamentally wrong with the design of spread() and gather(). Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.

In the new version of tidyr 1.0.0, tidyr team has simplified these reshaping functions and made them much more intuitive. tidyr 1.0.0 introduces two new functions like pivot_longer() and pivot_wider() to make it easy to reshape data. This brings in the familiar name – pivoting – that is prevalent in other programming languages. And they kind of supercede/replace older tidyr functions spread() and gather() (Both spread and gather will be around).

Until now R 4 for Data Science book is not tidyr 1.0.0 compliant and carried spread() and gather() functions from older tidyr 1.0.0. Not anymore, Garrett Grolemund, one of the authors of the book has updated the Chapter 12 on tidy data with pivot_longer() and pivot_wider() functions.

Thanks to @StatGarrett, the tidy data chapter of R4DS now uses pivot_longer() and pivot_wider(): https://t.co/7CsxP3db1l! #rstats

— Hadley Wickham (@hadleywickham) January 15, 2020

This post is just an effort to work through the newer version of tidy data chapter that uses pivot_longer() and pivot_wider() and learn the basics of tidy data.

First, let us make sure you have tidyr 1.0.0 and load the tidyverse.

library(tidyverse)

We will use number of small datasets that are readily available with tidyverse package.

How to Reshape a Wide Data to Long Data with pivot_longer()?

table4a is a small dataframe in tidyverse package in wide form. We can see that the column names are year variable and the values in the data frame are our data, number of cases.

table4a 

## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

Let us use pivot_longer() to make it tidy in long form. We would like to keep country as a separate variable and use the next two columns as a year variable.

To reshape with pivot_longer(), we first specify the columns we want to convert to a variable. Then specify the name of the new variable that we create out the columns using “names_to” argument. And then we specify a new variable name for the values in the data frame with the argument “values_to”.

The arguments names_to and values_to of the new function pivot_longer() is more intuitive to help to choose the right variable for reshaping the wide data to longer tidy data form.

table4a %>%
  pivot_longer(c(`1999`,`2000`), names_to="year", values_to = "cases")

We will get data frame in long form or tidy data.

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

In the example above, we selected the columns we want to use by specifying their names. We could also specify the columns in other ways. For example, we can select the columns to reshape as shown below.

table4a %>%
  pivot_longer(-country, names_to="year", values_to = "cases")

We will have the same results

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

Here is another data set in wide form.


table4b
## # A tibble: 3 x 3
##   country         `1999`     `2000`
## * <chr>            <int>      <int>
## 1 Afghanistan   19987071   20595360
## 2 Brazil       172006362  174504898
## 3 China       1272915272 1280428583

And we can use pivot_longer() as before to re-shape to tidy form.

table4b %>%
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to="population")

And we get data frame in long form.

## # A tibble: 6 x 3
##   country     year  population
##   <chr>       <chr>      <int>
## 1 Afghanistan 1999    19987071
## 2 Afghanistan 2000    20595360
## 3 Brazil      1999   172006362
## 4 Brazil      2000   174504898
## 5 China       1999  1272915272
## 6 China       2000  1280428583

pivot_wider()

pivot_wider() is another key function for reshaping data frames. It is the replacement for spread() function in older version of tidyr. pivot_wider() takes long/tidy data and widens the data, i.e. increases the number of columns in the data.

table2 data in tidyr is a tidy data in long form.

table2
## # A tibble: 12 x 4
##    country      year type            count
##    <chr>       <int> <chr>           <int>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583

Let us use pivot_wider() to reshape the data into wide form. We should specify which column should go to columns in the wide data using names_from argument. Similarly to fill the new columns in the wide data, we should specify the column/variable name with values_from argument.

table2 %>% 
  pivot_wider(names_from=type, values_from=count)

We have reshaped the tidy data to wide form. Note that we did not touch the first two columns of the tidy data, so they are left as it is in the wide data frame.

## # A tibble: 6 x 4
##   country      year  cases population
##   <chr>       <int>  <int>      <int>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583

Tidy Data: Exercises

With that introduction to simple reshaping with pivot_longer() and pivot_wider(), let us take stab at the exercises. Spoiler Alert: the following section might contain correct/incorrect solutions to exercises in Chapter 12 of R for Data Science.

Exercise 1: Why are pivot_longer() and pivot_wider() not perfectly symmetrical?

Let us say you have a wide data and reshape it to tidy form with pivot_longer() and then if you use pivot_wider(), you would expect that you would get the original data frame. The first exercise is about dataframe where you don’t get the same dataframe you started with after applying pivot_longer() and pivot_wider() functions.

Here we create a tibble with three variables with returns of some investment for two years.

stocks <- tibble(
  year   = c(2015, 2015, 2016, 2016),
  half  = c(   1,    2,     1,    2),
  return = c(1.88, 0.59, 0.92, 0.17)
)
stocks
## # A tibble: 4 x 3
##    year  half return
##   <dbl> <dbl>  <dbl>
## 1  2015     1   1.88
## 2  2015     2   0.59
## 3  2016     1   0.92
## 4  2016     2   0.17

We first use pivot_wider() to see the output

stocks %>% 
  pivot_wider(names_from = year, values_from = return)
## # A tibble: 2 x 3
##    half `2015` `2016`
##   <dbl>  <dbl>  <dbl>
## 1     1   1.88   0.92
## 2     2   0.59   0.17

And then we use pivot_wider() followed by pivot_longer() to recreate the original tibble.

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return")
## # A tibble: 4 x 3
##    half year  return
##   <dbl> <chr>  <dbl>
## 1     1 2015    1.88
## 2     1 2016    0.92
## 3     2 2015    0.59
## 4     2 2016    0.17

We see that we the tibble in tidy form. However, it is not the same as original tibble. The reason is year variable/column in the new tibble is a character variable. However, in the original tibble the year variable was of type double.

The chapter gives a hint about names_ptype argument to pivot_longer() function and asks what will happen if we set names_ptype = list(year = double()).

stocks %>% 
  pivot_wider(names_from = year, values_from = return) %>%
  pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return",
               names_ptype = list(year = double()))

Hurray, now get our original tibble back.

## # A tibble: 4 x 3
##    half  year return
##   <dbl> <dbl>  <dbl>
## 1     1  2015   1.88
## 2     1  2016   0.92
## 3     2  2015   0.59
## 4     2  2016   0.17

Exercise 2: Why does this code fail?

Second exercise is about using table4a, a tibble in wide form and reshaping with pivot_longer() function to long tidy form.

table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

Here is the code that fails to reshape

table4a %>% 
  pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")

It throws the following error

Error: Can't subset columns that don't exist. x The locations 1999 and 2000 don't exist. ? There are only 3 columns.

The reason above code chunk fails is because we tried to select the columns using their names as integers. Even though, they represent year coded as integers, column names are not integers.

Let us correct that and use pivot_longer() on table4a. Now it works as expected.

table4a %>% 
  pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")

## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

3. What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?

people <- tribble(
  ~name,             ~names,  ~values,
  #-----------------|--------|------
  "Phillip Woods",   "age",       45,
  "Phillip Woods",   "height",   186,
  "Phillip Woods",   "age",       50,
  "Jessica Cordero", "age",       37,
  "Jessica Cordero", "height",   156
)

people %>%
  pivot_wider(names_from = names, values_from = values)

When we try pivot_wider() on the tibble we get the following error. The reason is that there are duplicate entries in the dataframe. For example, Phillips Woods has two rows with two different age values. pivot_wider() throws the following warning and gives dataframe with list column for the duplicate variables.

## Warning: Values in `values` are not uniquely identified; output will contain list-cols.
## * Use `values_fn = list(values = list)` to suppress this warning.
## * Use `values_fn = list(values = length)` to identify where the duplicates arise
## * Use `values_fn = list(values = summary_fun)` to summarise duplicates
## # A tibble: 2 x 3
##   name                    age      height
##   <chr>           <list<dbl>> <list<dbl>>
## 1 Phillip Woods           [2]         [1]
## 2 Jessica Cordero         [1]         [1]

The warning also gives multiple options to deal with, we can ignore the warning, identify where the duplicates arise or a way to summarize the duplicate values.

Exercise 4. Do you need to make it wider or longer? What are the variables?

preg <- tribble(
  ~pregnant, ~male, ~female,
  "yes",     NA,    10,
  "no",      20,    12
)

We need to make the above tibble longer using pivot_longer() with variables sex and count as shown below.

preg %>% 
  pivot_longer(c(`male`,`female`),names_to = "sex", values_to = "count")
## # A tibble: 4 x 3
##   pregnant sex    count
##   <chr>    <chr>  <dbl>
## 1 yes      male      NA
## 2 yes      female    10
## 3 no       male      20
## 4 no       female    12