tidyr version 1.0.0 is here with a lot of new changes. tidyr has been around for about five years and it has finally tidyr has reached version 1.0.0. There are four big changes in the new version of tidyr. One of the biggest changes is the new functions pivot_longer() and pivot_wider() for reshaping tabular dataserts. These functions supercede the existing spread() and gather(). This change can be a bit disruptive to your workflow, but it is kind of expected.
tidyr 1.0.0 now on CRAN ?: https://t.co/sLnI2SlUsp — new pivot_longer() and pivot_wider() functions, better rectangling tools (unnest_longer(), unnest_wider(), hoist()), improved expand_grid(), new (un)nesting interface, and much much more! #rstats
— Hadley Wickham (@hadleywickham) September 13, 2019
A lot of data wrangling is about reshaping data from one form to another. In R, tidyerse’s tidyr package is at the core of reshaping tabular datasets. If you have used tidyr’s key functions spread and gather and felt that you don’t remember how you did the same operations last time, you are not alone. Even Hadley Wickham had to lookup documents to use these function. Earlier this year, Hadley Wickham announced that there will be simpler and “easier to understand” functions for reshaping data in the next version of tidyr.
You may have heard a rumour that gather/spread are going away. This is simply not true (they’ll stay around forever) but I am working on better replacements which you can learn about at https://t.co/sU2GzWeBaf. Now is a great time for feedback! #rstats
— Hadley Wickham (@hadleywickham) March 19, 2019
The new functions, pivot_longer() and pivot_wider(), for reshaping data is here now and they are substantially more powerful. These functions borrow ideas from existing packages like, data.table and cdata.
Other big changes include new set of functions for rectangling, to convert nested lists into tidy dataframes.
- unnest_auto()
- unnest_longer(),
- unnest_wider(),
- hoist()
In the new version of tidyr 1.0.0, we also have four new functions to make nesting easier.
- pack()/unpack()
- chop()/unchop()
In addition to these updates, new tidyr version also has new expand_grid() function, a variant of base::expand.grid() to create all possible combination of variables..
Getting started with pivot_longer and pivot_wider
Just, can’t wait to learn the new functions and their use. To start with, here is the first exploration of tidyr 1.0.0. In this part, we will see a step by step example of simpler uses of pivot_longer and pivot_wider functions using gapminder data set. We will start with un-tidy wider data and use pivot_longer to tidy the data and then use pivot_wider to make the longer tidy data to wider data frame.
pivot_longer is the replacement for gather() and pivot_wider() is the replacement for spread(). Both are designed to be simpler and can handle more cases than gather and spread. RStudio highly recommends you use the new functions although gather() and spread() are not going away but will not be actively devloped.
Let us first install the new version tidyr 1.0.0 and verify we have the new tidyr version.
> install.packages("tidyr") # check the installed package version > packageVersion("tidyr") [1] ‘1.0.0’
Let us load the new version of tidyr package and other packages needed.
library(tidyr) library(readr) library(dplyr)
Let us use gapminder dataset in wide form from Carpentries website.
data_url <- "https://goo.gl/ioc2Td" gapminder <-read_csv(data_url) head(gapminder)
We can see that the gapminder data frame is not tidy and in wide form. For example the column names are actually variables containing information about year and the type of variable.
## # A tibble: 6 x 38 ## continent country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 Africa Algeria 2449. 3014. 2551. ## 2 Africa Angola 3521. 3828. 4269. ## 3 Africa Benin 1063. 960. 949. ## 4 Africa Botswa… 851. 918. 984. ## 5 Africa Burkin… 543. 617. 723. ## 6 Africa Burundi 339. 380. 355. ## # … with 33 more variables: gdpPercap_1967 <dbl>, gdpPercap_1972 <dbl>, ## # gdpPercap_1977 <dbl>, gdpPercap_1982 <dbl>, gdpPercap_1987 <dbl>, ## # gdpPercap_1992 <dbl>, gdpPercap_1997 <dbl>, gdpPercap_2002 <dbl>, ## # gdpPercap_2007 <dbl>, lifeExp_1952 <dbl>, lifeExp_1957 <dbl>, ## # lifeExp_1962 <dbl>, lifeExp_1967 <dbl>, lifeExp_1972 <dbl>,
For our illustration here, let us simplify the gapminder dataframe in wide form to contain columns starting with “life” to get lifeExp variable and the year.
gapminder_life <- gapminder %>% select(continent,country,starts_with("life")) head(gapminder_life)
Now we can see that column names specify lifeExp for each year in our data set.
## # A tibble: 6 x 14 ## continent country lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Africa Algeria 43.1 45.7 48.3 51.4 ## 2 Africa Angola 30.0 32.0 34 36.0 ## 3 Africa Benin 38.2 40.4 42.6 44.9 ## 4 Africa Botswa… 47.6 49.6 51.5 53.3 ## 5 Africa Burkin… 32.0 34.9 37.8 40.7 ## 6 Africa Burundi 39.0 40.5 42.0 43.5 ## # … with 8 more variables: lifeExp_1972 <dbl>, lifeExp_1977 <dbl>,
To make the wide data frame to tidy form, where each column is a variable and each row is an observation, we can use pivot_longer() function from the new version of tidyr.
pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.
In the simplest use case here, we first specify which columns needs to be reshaped. Since the first first two columns, continent and country are variables already in tidy form, we specify that we need to reshape all columns except these two. Then we specify the variable name for column names using “names_to” argument and then variable name for the values in columns using the argument “values_to”. Note that these argument takes the new variable names with quotes, as they are present in the data frame yet.
gapminder_life %>% pivot_longer(-c(continent,country), names_to = "year", values_to = "lifeExp")
The result from pivot_longer() function is a tibble with four columns, where the firs two columns are the old ones and the remaining two columns are the new ones that we created. We can see that the variable year contains the column names in the wide data frame and the lifeExp contains the actual values.
## # A tibble: 1,704 x 4 ## continent country year lifeExp ## <chr> <chr> <chr> <dbl> ## 1 Africa Algeria lifeExp_1952 43.1 ## 2 Africa Algeria lifeExp_1957 45.7 ## 3 Africa Algeria lifeExp_1962 48.3 ## 4 Africa Algeria lifeExp_1967 51.4 ## 5 Africa Algeria lifeExp_1972 54.5
Let us see an example of how to use pivot_wider() to convert a data frame in tidy form to a data frame in non-tidy/wider form. Let us use the tidy data frame from the above example.
gapminder_tidy <- gapminder_life %>% pivot_longer(-c(continent,country), names_to = "year", values_to = "lifeExp")
Now we have the tidy tall data frame. Let us use pivot_wider() to reshape the tidy data to wide data frame. As a first argument to pivot_wider() function, we need to specify which column in the tidy data frame should be column names in the wide form. In our example, year should be the column names of the wide/non-tidy data and we provide that to the argument “names_from”.
And then we specify which column/variable should be values in non-tidy data frame as argument to “values_from”.
gapminder_tidy %>% pivot_wider(names_from = year, values_from = lifeExp)
The result from pivot_wider() function is our original gap minder data frame in wide form.
## # A tibble: 142 x 14 ## continent country lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 Africa Algeria 43.1 45.7 48.3 51.4 ## 2 Africa Angola 30.0 32.0 34 36.0 ## 3 Africa Benin 38.2 40.4 42.6 44.9 ## 4 Africa Botswa… 47.6 49.6 51.5 53.3 ## 5 Africa Burkin… 32.0 34.9 37.8 40.7
In these simple examples using pivot_longer() and pivot_wider(), it is clear that the argument names definitely make more sense.
Look forward to examples of other new functions in the new version of tidy 1.0.0 soon.