• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / R / tidyverse / group_by() dplyr / Introduction to nest() in tidyr

Introduction to nest() in tidyr

April 16, 2019 by cmdlinetips

Grouping our data in specific ways and analyzing is often the most common way to make interesting observations about the data. R tidyverse offers fantastic tool set to analyze data by grouping in different ways. Tidyverse dplyr’s group_by() is one of the basic verbs that is extremely useful in most common data analyis scenarios.
nest() creates a list of data frames containing all the nested variables: this seems to be the most useful form in practice.

Tidyr’s nest() offers help in more general group-wise operations.
Before explaining what nest() actually does, let us think of a scenario where you have a dataframe with multiple columns, like gapminder data with 6 columns including one for continent.

When we use group_by() on continent, under the hood group_by() is creating a data frame for each value of continent variable. For example, gapminder data has 6 continents, so group_by(continent) creates six smaller dataframes, one each for 6 continents.

nest() function in tidyr in combination with group_by(continent) function makes the smaller dataframes available to us as a list within a dataframe. This can be extremely handy for any downstream analysis. Let us see an example of using nest() with gapminder data set.

Let us load the packages we need

library(tidyr)
library(gapminder)

For example, if we use nest() after using group_by() on continent from gapminder data, we get a tibble.

nested_by_continent <- gapminder %>% 
  group_by(continent) %>%
  nest()

The tibble ( kind of a dataframe object) has row each continent and a smaller dataframe with other variables corresponding to the continent in a list.

## # A tibble: 5 x 2
##   continent data              
##   <fct>     <list>            
## 1 Asia      <tibble [396 × 5]>
## 2 Europe    <tibble [360 × 5]>
## 3 Africa    <tibble [624 × 5]>
## 4 Americas  <tibble [300 × 5]>
## 5 Oceania   <tibble [24 × 5]>

Note in the tibble, the first column name is continent, the variable we used to nest() and the second column is named “data” by default. We can access the first column using [[]] notation.

nested_by_continent[['continent']]

The data column is of list datatype containing the small tibble/dataframe corresponding to the continent value. In our example, the first element in data is a tibble for the continent asia and it is of size 360 x 5. We can access the first element in data by using [[]][[]] notation.

Adding linear model objects to tibble

The biggest use of nesting lies in downstream computing we can do easily. For example, let us say we are interested in fitting a linear regression model between lifeExp and gdpPercap for each continent and save the model summary for later use.

Let us work towards doing this in a tidy way. Basically we want to use the data in smaller dataframe in the nested object and build linear regression model. Let us start with one dataframe first.

We can access the first continent’s data frame using

nested_by_continent[['data']][[1]]

We can build linear model between lifeExp and gdpPercap

fit <- lm(lifeExp ~ gdpPercap, 
          data=nested_by_continent[['data']][[1]])

This gives us linear model for the dataframe we feed in. Let us make that as function that takes a data frame as input applies lm .

le_vs_gdpPercap <- function(df) {
  lm(lifeExp ~ gdpPercap, data = df)
}

The above function helps us generalize building lm on lifeExp and gdpPercap. We can use the above function on each one of our data frames in the list in the nested object.

Instead of for loop to go through each smaller dataframe, we can use mapfunction purrr package to do the work for us.

For example we can map through first two continent’s dataframe using

map(nested_by_continent$data[1:2], le_vs_gdpPercap)

We get the following results, two summaries of the linear model.

 
## [[1]]
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = df)
## 
## Coefficients:
## (Intercept)    gdpPercap  
##   5.751e+01    3.227e-04  
## 
## 
## [[2]]
## 
## Call:
## lm(formula = lifeExp ~ gdpPercap, data = df)
## 
## Coefficients:
## (Intercept)    gdpPercap  
##   6.534e+01    4.535e-04

Now we can combine map function with nest() and mutate() to add the results of linear model for each continent to a tibble, just like we added smaller dataframes to tibble.

We already have the tibble from nest() and we can feed that to mutate() function, where we map data to the linear modelfunction we wrote.

 
nested_by_continent %>%
 mutate(fit = map(data, le_vs_gdpPercap))

The result is a tibble with extra column containing the linear model objects available as a list.

 
## # A tibble: 5 x 3
##   continent data               fit     
##   <fct>     <list>             <list>  
## 1 Asia      <tibble [396 × 5]> <S3: lm>
## 2 Europe    <tibble [360 × 5]> <S3: lm>
## 3 Africa    <tibble [624 × 5]> <S3: lm>
## 4 Americas  <tibble [300 × 5]> <S3: lm>
## 5 Oceania   <tibble [24 × 5]>  <S3: lm>

Adding ggplot objects to tibble

Nesting is such useful intermediate function, one can store any object for further use downstream. For example, instead of storing a linear model for each continent, we can ggplot object for a scatter plot between lifeExp and gdpPercap.

Let us write a small function to make the scatter plot.

 
# scatter plot
plot_le_gdp <- function(df){
  p <- df %>% ggplot(aes(x=lifeExp, y=gdpPercap)) +
    geom_point(alpha=0.5) + 
    scale_y_log10()
  return(p)
}    

We can use the function in combination with nest(), mutate() and map() to add the plot object to resulting tibble.

 
nested_by_continent %>% 
   mutate(fit=map(data, le_vs_gdpPercap))

We get a tibble , where each row is for each continent and its tibble and the ggplot object for scatter plot.

 
## # A tibble: 5 x 3
##   continent data               fit     
##   <fct>     <list>             <list>  
## 1 Asia      <tibble [396 × 5]> <S3: lm>
## 2 Europe    <tibble [360 × 5]> <S3: lm>
## 3 Africa    <tibble [624 × 5]> <S3: lm>
## 4 Americas  <tibble [300 × 5]> <S3: lm>
## 5 Oceania   <tibble [24 × 5]>  <S3: lm>

Storing ggplot object may not be that useful, but isn’t nest() cool?

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default ThumbnailIntroduction to Linear Regression in Python Default ThumbnailIntroduction to Linear Regression in R split apply combine examplePandas GroupBy: Introduction to Split-Apply-Combine Default ThumbnailHow to Add Group-Level Summary Statistic as a New Column in Pandas?

Filed Under: group_by() dplyr, nest() tidyr, nest() tidyverse Tagged With: group_by() dplyr, nest() tidyr, nest() tidyverse

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version