tidyr’s pivot_longer(): Reshape Wide Data to Long/Tidy Data

pivot_longer(): tidyr

One of the most common activities while doing data analysis is to reshape data from one form to another. For human eyes and data collection, often it is easier to work with data in wider form. However, for analyzing data it is more convenient to have the data in tidy/long form in most circumstances. tidyr, R package part of tidyverse, offer a number of functions to manipulate datasets in wide or long form.

pivot_longer() and pivot_wider() are two core functions of tidyr version 1.0.0 and above for reshaping data. In this post, we will see examples of using pivot_longer() to convert data in wide form to data in long or tidy form.

As tidyr defines, Tidy data is data where:

  • Every column is variable.
  • Every row is an observation.
  • Every cell is a single value.
  • Want to learn more about tidy data? Check out the original paper from Hadley Wickham.

    Let us get started by loading tidyverse, suite of R packages from RStudio.

    library(tidyverse)
    

    We will use historical mobile phone data from TidyTuesday project. For this post, we need data in wide form. We used pivot_wider() from tidyr as shown in this post to get mobile data in wide form and saved the results in cmdlinetips’s github page.

    Reshape a simple wide data to tidy/long data

    To start with, we will use a data set in simple wide form and load it.

    wide_df  <- readr::read_tsv("https://raw.githubusercontent.com/cmdlinetips/data/master/mobile_subscription_wide_simple_tidytuesday.tsv")
    

    We can see that the data is in wide form and the column names are years and the values in the dataframe is mobile subscribers.

    head(wide_df)
    ## # A tibble: 6 x 30
    ##   code  continent  `1990` `1991` `1992`  `1993`  `1994` `1995` `1996`  `1997`
    ##   <chr> <chr>       <dbl>  <dbl>  <dbl>   <dbl>   <dbl>  <dbl>  <dbl>   <dbl>
    ## 1 AFG   Asia      0       0      0      0       0       0      0       0     
    ## 2 ALB   Europe    0       0      0      0       0       0      0.0744  0.107 
    ## 3 DZA   Africa    0.00181 0.0180 0.0176 0.0172  0.00475 0.0162 0.0398  0.0582
    ## 4 ASM   Oceania   0       0      1.41   1.77    2.32    2.36   2.41    2.55  
    ## 5 AND   Europe    0       0      1.31   1.28    1.25    4.42   8.53   13.4   
    ## 6 AGO   Africa    0       0      0      0.00821 0.0132  0.0140 0.0225  0.0467
    ## # … with 20 more variables: `1998` <dbl>, `1999` <dbl>, `2000` <dbl>,
    ## #   `2001` <dbl>, `2002` <dbl>, `2003` <dbl>, `2004` <dbl>, `2005` <dbl>,
    ## #   `2006` <dbl>, `2007` <dbl>, `2008` <dbl>, `2009` <dbl>, `2010` <dbl>,
    ## #   `2011` <dbl>, `2012` <dbl>, `2013` <dbl>, `2014` <dbl>, `2015` <dbl>,
    ## #   `2016` <dbl>, `2017` <dbl>
    

    Let us start with a simple example, where we keep one of the columns as the same in tidy form. To do that we will remove one column and use the remaining to reshape to tidy form

    pivot_longer:  Reshape wide data tidy/long data
    pivot_longer: Reshape wide data tidy/long data

    We use pivot_longer() and specify which column we want to keep in the tidy form first. And then specify a variable for the column names using “names_to” argument and a variable for the values in the wide form using “values_to” argument.

    In our example, we would like to keep the continent column in the wide dataframe and specify names_to=”year” and values_to = “mobile_subs”.

    wide_df %>%
      select(-code) %>%
      pivot_longer(-continent,
                   names_to="year",
                   values_to = "mobile_subs")
    

    And we get a dataframe in tidy or long form with three columns.

    ## # A tibble: 6,944 x 3
    ##    continent year  mobile_subs
    ##    <chr>     <chr>       <dbl>
    ##  1 Asia      1990            0
    ##  2 Asia      1991            0
    ##  3 Asia      1992            0
    ##  4 Asia      1993            0
    ##  5 Asia      1994            0
    ##  6 Asia      1995            0
    ##  7 Asia      1996            0
    ##  8 Asia      1997            0
    ##  9 Asia      1998            0
    ## 10 Asia      1999            0
    ## # … with 6,934 more rows
    

    Sometimes, you might want to keep more than one variables as its in the reshaped data. The way to do it is to specify the names of variables to keep as vector with negative sign or negate operator.

    wide_df %>%
      pivot_longer(-c(continent,code),
                   names_to="year",
                   values_to = "mobile_subs")
    

    This will keep the two columns we specified as it is and reshape the rest of the data. Now we get tidy dataframe with four columns.

    ## # A tibble: 6,944 x 4
    ##    code  continent year  mobile_subs
    ##    <chr> <chr>     <chr>       <dbl>
    ##  1 AFG   Asia      1990            0
    ##  2 AFG   Asia      1991            0
    ##  3 AFG   Asia      1992            0
    ##  4 AFG   Asia      1993            0
    ##  5 AFG   Asia      1994            0
    ##  6 AFG   Asia      1995            0
    ##  7 AFG   Asia      1996            0
    ##  8 AFG   Asia      1997            0
    ##  9 AFG   Asia      1998            0
    ## 10 AFG   Asia      1999            0
    ## # … with 6,934 more rows
    

    pivot_longer() with multiple variables

    Most often our wide dataframe is more complex and might have multiple variables that need to be reshaped. To illustrate how to use pivot_longer to reshape wide dataframe with multiple variable, we will be using historical mobile data with multiple variables.

    Here is a cartoon of wide dataframe with three variables and the resulting tidy data we would like to have.

    pivot_longer with multiple variables

    We have wide dataframe with mobile subscribers, gdp and total population in wide form at cmdlinetips.com’s github page. Let us load that data directly from github.

    wide_multi_df  <- readr::read_tsv('https://raw.githubusercontent.com/cmdlinetips/data/master/mobile_subscription_wide_multiple_variables_tidytuesday.tsv')
    

    And our data in wide form looks like this. Note that column names has variable name and year information and the delimiter separating them is dot “.”.

    wide_multi_df
    ## # A tibble: 248 x 86
    ##    code  continent mobile_subs.1990 mobile_subs.1991 mobile_subs.1992
    ##    <chr> <chr>                <dbl>            <dbl>            <dbl>
    ##  1 AFG   Asia               0                 0                0     
    ##  2 ALB   Europe             0                 0                0     
    ##  3 DZA   Africa             0.00181           0.0180           0.0176
    ##  4 ASM   Oceania            0                 0                1.41  
    ##  5 AND   Europe             0                 0                1.31  
    ##  6 AGO   Africa             0                 0                0     
    ##  7 AIA   Americas          NA                NA               NA     
    ##  8 ATG   Americas           0                NA               NA     
    ##  9 ARG   Americas           0.0367            0.0753           0.138 
    ## 10 ARM   Asia               0                 0                0     
    ## # … with 238 more rows, and 81 more variables: mobile_subs.1993 <dbl>,
    ## #   mobile_subs.1994 <dbl>, mobile_subs.1995 <dbl>, mobile_subs.1996 <dbl>,
    

    To reshape wide data with multiple variables using pivot_longer, we need first select columns of interest corresponding to multiple variables. In the example below, we suse dplyr’s starts_with() function select columns starting with strings corresponding to three variables. Then we use names_to and values_to arguments of pivot_longer() to create new variables in tidy/long form.

    tidy_df <- wide_multi_df %>%
      pivot_longer( #cols = mobile_subs.1990:total_pop.2017,
                    cols = c(starts_with("mobile"),
                             starts_with("gdp"),
                             starts_with("total")),
                    names_to = "var_name",
                    values_to = "val"
    )
    

    And we get data in tidy form with four columns of interest.

    tidy_df
    
    ## # A tibble: 20,832 x 4
    ##    code  continent var_name           val
    ##    <chr> <chr>     <chr>            <dbl>
    ##  1 AFG   Asia      mobile_subs.1990     0
    ##  2 AFG   Asia      mobile_subs.1991     0
    ##  3 AFG   Asia      mobile_subs.1992     0
    ##  4 AFG   Asia      mobile_subs.1993     0
    ##  5 AFG   Asia      mobile_subs.1994     0
    ##  6 AFG   Asia      mobile_subs.1995     0
    ##  7 AFG   Asia      mobile_subs.1996     0
    ##  8 AFG   Asia      mobile_subs.1997     0
    ##  9 AFG   Asia      mobile_subs.1998     0
    ## 10 AFG   Asia      mobile_subs.1999     0
    ## # … with 20,822 more rows
    

    Often we like the data in some kind of intermediate wide form. Here is a cartoon illustration of the reshape we are interested in.

    tidy data to intermediate wide

    For example in this case, we would like to have three variables to be on separate columns. We can use pivot_wider() to create the dataframe with three variables on separate column. At first, we have to separate the variable name and year variables. One of the ways to do that is to use separate() function tidyr.

    tidy_df %>%
      separate(var_name,c("var_n","year"), sep="\\.") %>%
      pivot_wider(names_from = var_n,
                  values_from=val)
    

    And here is the user friendly dataframe reshaped from tidy dataframe.

    ## # A tibble: 6,944 x 6
    ##    code  continent year  mobile_subs gdp_per_cap total_pop
    ##    <chr> <chr>     <chr>       <dbl>       <dbl>     <dbl>
    ##  1 AFG   Asia      1990            0          NA  13032161
    ##  2 AFG   Asia      1991            0          NA  14069854
    ##  3 AFG   Asia      1992            0          NA  15472076
    ##  4 AFG   Asia      1993            0          NA  17053213
    ##  5 AFG   Asia      1994            0          NA  18553819
    ##  6 AFG   Asia      1995            0          NA  19789880
    ##  7 AFG   Asia      1996            0          NA  20684982
    ##  8 AFG   Asia      1997            0          NA  21299350
    ##  9 AFG   Asia      1998            0          NA  21752257
    ## 10 AFG   Asia      1999            0          NA  22227543
    ## # … with 6,934 more rows