Introduction to nest() in tidyr

Grouping our data in specific ways and analyzing is often the most common way to make interesting observations about the data. R tidyverse offers fantastic tool set to analyze data by grouping in different ways. Tidyverse dplyr’s group_by() is one of the basic verbs that is extremely useful in most common data analyis scenarios.
nest() creates a list of data frames containing all the nested variables: this seems to be the most useful form in practice.

Tidyr’s nest() offers help in more general group-wise operations.
Before explaining what nest() actually does, let us think of a scenario where you have a dataframe with multiple columns, like gapminder data with 6 columns including one for continent.

When we use group_by() on continent, under the hood group_by() is creating a data frame for each value of continent variable. For example, gapminder data has 6 continents, so group_by(continent) creates six smaller dataframes, one each for 6 continents.

nest() function in tidyr in combination with group_by(continent) function makes the smaller dataframes available to us as a list within a dataframe. This can be extremely handy for any downstream analysis. Let us see an example of using nest() with gapminder data set.

Let us load the packages we need

library(tidyr)
library(gapminder)

For example, if we use nest() after using group_by() on continent from gapminder data, we get a tibble.

nested_by_continent <- gapminder %>% 
  group_by(continent) %>%
  nest()

The tibble ( kind of a dataframe object) has row each continent and a smaller dataframe with other variables corresponding to the continent in a list.

## # A tibble: 5 x 2
##   continent data              
##   <fct>     <list>            
## 1 Asia      <tibble [396 × 5]>
## 2 Europe    <tibble [360 × 5]>
## 3 Africa    <tibble [624 × 5]>
## 4 Americas  <tibble [300 × 5]>
## 5 Oceania   <tibble [24 × 5]>

Note in the tibble, the first column name is continent, the variable we used to nest() and the second column is named “data” by default. We can access the first column using [[]] notation.

nested_by_continent[['continent']]

The data column is of list datatype containing the small tibble/dataframe corresponding to the continent value. In our example, the first element in data is a tibble for the continent asia and it is of size 360 x 5. We can access the first element in data by using [[]][[]] notation.

Adding linear model objects to tibble

The biggest use of nesting lies in downstream computing we can do easily. For example, let us say we are interested in fitting a linear regression model between lifeExp and gdpPercap for each continent and save the model summary for later use.

Let us work towards doing this in a tidy way. Basically we want to use the data in smaller dataframe in the nested object and build linear regression model. Let us start with one dataframe first.

We can access the first continent’s data frame using

nested_by_continent[['data']][[1]]

We can build linear model between lifeExp and gdpPercap

fit <- lm(lifeExp ~ gdpPercap, 
          data=nested_by_continent[['data']][[1]])

This gives us linear model for the dataframe we feed in. Let us make that as function that takes a data frame as input applies lm .

le_vs_gdpPercap <- function(df) {
  lm(lifeExp ~ gdpPercap, data = df)
}

The above function helps us generalize building lm on lifeExp and gdpPercap. We can use the above function on each one of our data frames in the list in the nested object.

Instead of for loop to go through each smaller dataframe, we can use mapfunction purrr package to do the work for us.

For example we can map through first two continent’s dataframe using

map(nested_by_continent$data[1:2], le_vs_gdpPercap)

We get the following results, two summaries of the linear model.

 
## [[1]]
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = df)
## 
## Coefficients:
## (Intercept)    gdpPercap  
##   5.751e+01    3.227e-04  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = df)
## 
## Coefficients:
## (Intercept)    gdpPercap  
##   6.534e+01    4.535e-04

Now we can combine map function with nest() and mutate() to add the results of linear model for each continent to a tibble, just like we added smaller dataframes to tibble.

We already have the tibble from nest() and we can feed that to mutate() function, where we map data to the linear modelfunction we wrote.

 
nested_by_continent %>%
 mutate(fit = map(data, le_vs_gdpPercap))

The result is a tibble with extra column containing the linear model objects available as a list.

 
## # A tibble: 5 x 3
##   continent data               fit     
##   <fct>     <list>             <list>  
## 1 Asia      <tibble [396 × 5]> <S3: lm>
## 2 Europe    <tibble [360 × 5]> <S3: lm>
## 3 Africa    <tibble [624 × 5]> <S3: lm>
## 4 Americas  <tibble [300 × 5]> <S3: lm>
## 5 Oceania   <tibble [24 × 5]>  <S3: lm>

Adding ggplot objects to tibble

Nesting is such useful intermediate function, one can store any object for further use downstream. For example, instead of storing a linear model for each continent, we can ggplot object for a scatter plot between lifeExp and gdpPercap.

Let us write a small function to make the scatter plot.

 
# scatter plot
plot_le_gdp <- function(df){
  p <- df %>% ggplot(aes(x=lifeExp, y=gdpPercap)) +
    geom_point(alpha=0.5) + 
    scale_y_log10()
  return(p)
}    

We can use the function in combination with nest(), mutate() and map() to add the plot object to resulting tibble.

 
nested_by_continent %>% 
   mutate(fit=map(data, le_vs_gdpPercap))

We get a tibble , where each row is for each continent and its tibble and the ggplot object for scatter plot.

 
## # A tibble: 5 x 3
##   continent data               fit     
##   <fct>     <list>             <list>  
## 1 Asia      <tibble [396 × 5]> <S3: lm>
## 2 Europe    <tibble [360 × 5]> <S3: lm>
## 3 Africa    <tibble [624 × 5]> <S3: lm>
## 4 Americas  <tibble [300 × 5]> <S3: lm>
## 5 Oceania   <tibble [24 × 5]>  <S3: lm>

Storing ggplot object may not be that useful, but isn’t nest() cool?