R for Data Science book by Garrett Grolemund and Hadley Wickham is the best book for doing data science with tidyverse. tidyverse, the meta-package, has loads of useful packages like tidyr, dplyr, and ggplot2 to make your life as data scientist easy. Last fall, tidyr package got a big update with version 1.0.0.
Until now, before tidyr 1.0.0, gather() and. spread() are two key functions of tidyr package that help reshape data; from long form to wide form and from wide form to long form. Although these functions were amazing, it was difficult to use them without looking at the docs everytime (even for Hadley Wickham).
While introducing tidyr, Hadely Wickham wrote
For some time, it’s been obvious that there is something fundamentally wrong with the design of spread() and gather(). Many people don’t find the names intuitive and find it hard to remember which direction corresponds to spreading and which to gathering. It also seems surprisingly hard to remember the arguments to these functions, meaning that many people (including me!) have to consult the documentation every time.
In the new version of tidyr 1.0.0, tidyr team has simplified these reshaping functions and made them much more intuitive. tidyr 1.0.0 introduces two new functions like pivot_longer() and pivot_wider() to make it easy to reshape data. This brings in the familiar name – pivoting – that is prevalent in other programming languages. And they kind of supercede/replace older tidyr functions spread() and gather() (Both spread and gather will be around).
Until now R 4 for Data Science book is not tidyr 1.0.0 compliant and carried spread() and gather() functions from older tidyr 1.0.0. Not anymore, Garrett Grolemund, one of the authors of the book has updated the Chapter 12 on tidy data with pivot_longer() and pivot_wider() functions.
Thanks to @StatGarrett, the tidy data chapter of R4DS now uses pivot_longer() and pivot_wider(): https://t.co/7CsxP3db1l! #rstats
— Hadley Wickham (@hadleywickham) January 15, 2020
This post is just an effort to work through the newer version of tidy data chapter that uses pivot_longer() and pivot_wider() and learn the basics of tidy data.
First, let us make sure you have tidyr 1.0.0 and load the tidyverse.
library(tidyverse)
We will use number of small datasets that are readily available with tidyverse package.
How to Reshape a Wide Data to Long Data with pivot_longer()?
table4a is a small dataframe in tidyverse package in wide form. We can see that the column names are year variable and the values in the data frame are our data, number of cases.
table4a ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766
Let us use pivot_longer() to make it tidy in long form. We would like to keep country as a separate variable and use the next two columns as a year variable.
To reshape with pivot_longer(), we first specify the columns we want to convert to a variable. Then specify the name of the new variable that we create out the columns using “names_to” argument. And then we specify a new variable name for the values in the data frame with the argument “values_to”.
The arguments names_to and values_to of the new function pivot_longer() is more intuitive to help to choose the right variable for reshaping the wide data to longer tidy data form.
table4a %>% pivot_longer(c(`1999`,`2000`), names_to="year", values_to = "cases")
We will get data frame in long form or tidy data.
## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766
In the example above, we selected the columns we want to use by specifying their names. We could also specify the columns in other ways. For example, we can select the columns to reshape as shown below.
table4a %>% pivot_longer(-country, names_to="year", values_to = "cases")
We will have the same results
## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766
Here is another data set in wide form.
table4b ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 19987071 20595360 ## 2 Brazil 172006362 174504898 ## 3 China 1272915272 1280428583
And we can use pivot_longer() as before to re-shape to tidy form.
table4b %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to="population")
And we get data frame in long form.
## # A tibble: 6 x 3 ## country year population ## <chr> <chr> <int> ## 1 Afghanistan 1999 19987071 ## 2 Afghanistan 2000 20595360 ## 3 Brazil 1999 172006362 ## 4 Brazil 2000 174504898 ## 5 China 1999 1272915272 ## 6 China 2000 1280428583
pivot_wider()
pivot_wider() is another key function for reshaping data frames. It is the replacement for spread() function in older version of tidyr. pivot_wider() takes long/tidy data and widens the data, i.e. increases the number of columns in the data.
table2 data in tidyr is a tidy data in long form.
table2 ## # A tibble: 12 x 4 ## country year type count ## <chr> <int> <chr> <int> ## 1 Afghanistan 1999 cases 745 ## 2 Afghanistan 1999 population 19987071 ## 3 Afghanistan 2000 cases 2666 ## 4 Afghanistan 2000 population 20595360 ## 5 Brazil 1999 cases 37737 ## 6 Brazil 1999 population 172006362 ## 7 Brazil 2000 cases 80488 ## 8 Brazil 2000 population 174504898 ## 9 China 1999 cases 212258 ## 10 China 1999 population 1272915272 ## 11 China 2000 cases 213766 ## 12 China 2000 population 1280428583
Let us use pivot_wider() to reshape the data into wide form. We should specify which column should go to columns in the wide data using names_from argument. Similarly to fill the new columns in the wide data, we should specify the column/variable name with values_from argument.
table2 %>% pivot_wider(names_from=type, values_from=count)
We have reshaped the tidy data to wide form. Note that we did not touch the first two columns of the tidy data, so they are left as it is in the wide data frame.
## # A tibble: 6 x 4 ## country year cases population ## <chr> <int> <int> <int> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583
Tidy Data: Exercises
With that introduction to simple reshaping with pivot_longer() and pivot_wider(), let us take stab at the exercises. Spoiler Alert: the following section might contain correct/incorrect solutions to exercises in Chapter 12 of R for Data Science.
Exercise 1: Why are pivot_longer() and pivot_wider() not perfectly symmetrical?
Let us say you have a wide data and reshape it to tidy form with pivot_longer() and then if you use pivot_wider(), you would expect that you would get the original data frame. The first exercise is about dataframe where you don’t get the same dataframe you started with after applying pivot_longer() and pivot_wider() functions.
Here we create a tibble with three variables with returns of some investment for two years.
stocks <- tibble( year = c(2015, 2015, 2016, 2016), half = c( 1, 2, 1, 2), return = c(1.88, 0.59, 0.92, 0.17) ) stocks ## # A tibble: 4 x 3 ## year half return ## <dbl> <dbl> <dbl> ## 1 2015 1 1.88 ## 2 2015 2 0.59 ## 3 2016 1 0.92 ## 4 2016 2 0.17
We first use pivot_wider() to see the output
stocks %>% pivot_wider(names_from = year, values_from = return) ## # A tibble: 2 x 3 ## half `2015` `2016` ## <dbl> <dbl> <dbl> ## 1 1 1.88 0.92 ## 2 2 0.59 0.17
And then we use pivot_wider() followed by pivot_longer() to recreate the original tibble.
stocks %>% pivot_wider(names_from = year, values_from = return) %>% pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return") ## # A tibble: 4 x 3 ## half year return ## <dbl> <chr> <dbl> ## 1 1 2015 1.88 ## 2 1 2016 0.92 ## 3 2 2015 0.59 ## 4 2 2016 0.17
We see that we the tibble in tidy form. However, it is not the same as original tibble. The reason is year variable/column in the new tibble is a character variable. However, in the original tibble the year variable was of type double.
The chapter gives a hint about names_ptype argument to pivot_longer() function and asks what will happen if we set names_ptype = list(year = double()).
stocks %>% pivot_wider(names_from = year, values_from = return) %>% pivot_longer(`2015`:`2016`, names_to = "year", values_to = "return", names_ptype = list(year = double()))
Hurray, now get our original tibble back.
## # A tibble: 4 x 3 ## half year return ## <dbl> <dbl> <dbl> ## 1 1 2015 1.88 ## 2 1 2016 0.92 ## 3 2 2015 0.59 ## 4 2 2016 0.17
Exercise 2: Why does this code fail?
Second exercise is about using table4a, a tibble in wide form and reshaping with pivot_longer() function to long tidy form.
table4a ## # A tibble: 3 x 3 ## country `1999` `2000` ## * <chr> <int> <int> ## 1 Afghanistan 745 2666 ## 2 Brazil 37737 80488 ## 3 China 212258 213766
Here is the code that fails to reshape
table4a %>% pivot_longer(c(1999, 2000), names_to = "year", values_to = "cases")
It throws the following error
Error: Can't subset columns that don't exist. x The locations 1999 and 2000 don't exist. ? There are only 3 columns.
The reason above code chunk fails is because we tried to select the columns using their names as integers. Even though, they represent year coded as integers, column names are not integers.
Let us correct that and use pivot_longer() on table4a. Now it works as expected.
table4a %>% pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases") ## # A tibble: 6 x 3 ## country year cases ## <chr> <chr> <int> ## 1 Afghanistan 1999 745 ## 2 Afghanistan 2000 2666 ## 3 Brazil 1999 37737 ## 4 Brazil 2000 80488 ## 5 China 1999 212258 ## 6 China 2000 213766
3. What would happen if you widen this table? Why? How could you add a new column to uniquely identify each value?
people <- tribble( ~name, ~names, ~values, #-----------------|--------|------ "Phillip Woods", "age", 45, "Phillip Woods", "height", 186, "Phillip Woods", "age", 50, "Jessica Cordero", "age", 37, "Jessica Cordero", "height", 156 )
people %>% pivot_wider(names_from = names, values_from = values)
When we try pivot_wider() on the tibble we get the following error. The reason is that there are duplicate entries in the dataframe. For example, Phillips Woods has two rows with two different age values. pivot_wider() throws the following warning and gives dataframe with list column for the duplicate variables.
## Warning: Values in `values` are not uniquely identified; output will contain list-cols. ## * Use `values_fn = list(values = list)` to suppress this warning. ## * Use `values_fn = list(values = length)` to identify where the duplicates arise ## * Use `values_fn = list(values = summary_fun)` to summarise duplicates ## # A tibble: 2 x 3 ## name age height ## <chr> <list<dbl>> <list<dbl>> ## 1 Phillip Woods [2] [1] ## 2 Jessica Cordero [1] [1]
The warning also gives multiple options to deal with, we can ignore the warning, identify where the duplicates arise or a way to summarize the duplicate values.
Exercise 4. Do you need to make it wider or longer? What are the variables?
preg <- tribble( ~pregnant, ~male, ~female, "yes", NA, 10, "no", 20, 12 )
We need to make the above tibble longer using pivot_longer() with variables sex and count as shown below.
preg %>% pivot_longer(c(`male`,`female`),names_to = "sex", values_to = "count") ## # A tibble: 4 x 3 ## pregnant sex count ## <chr> <chr> <dbl> ## 1 yes male NA ## 2 yes female 10 ## 3 no male 20 ## 4 no female 12