How to Randomly Select Groups in R with dplyr?

Sampling, randomly sub-setting, your data is often extremely useful in many situations. If you are interested in randomly sampling without regard to the groups, we can use sample_n() function from dplyr.

Sometimes you might want to sample one or multiple groups with all elements/rows within the selected group(s).

However, sampling one or more groups with all elements is slightly tricky.

Thankfully, someone else has already faced the same problem and has a solution. Here is the illustration of the randomly sampling groups in R with dplyr.

Let us first load the packages needed. We will use the dataset from gapminder package.

library(tidyverse)
library(gapminder)

gapminder dataset has data lifeExp, population and gdpPercap for multiple continents, countries and years.

head(gapminder, n=3)

## # A tibble: 3 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.

Let us say we want to randomly select 2 countries, from 142 countries, with all the data corresponding to these two countries.

Here is a step-by-step instruction to sample groups from a dataframe in R.

Step 1: Select Grouping variable

First, we need to choose which grouping variable of interest. In this example, our grouping variable is country.

group_var <- gapminder %>% 
  group_by(country) %>%
  groups %>%
  unlist %>% 
  as.character

group_var

## [1] "country"

Step 2: Select groups randomly

Next we need to select two countries randomly, so that we can use it to select all data corresponding to these countries. Here we randomly select two countries and also assign unique ID for each country.

set.seed(42)
random_country <- gapminder %>% 
  group_by(country) %>% 
  summarise() %>% 
  sample_n(2) %>% 
  mutate(unique_id=1:NROW(.))

random_country
## # A tibble: 2 x 2
##   country unique_id
##   <fct>       <int>
## 1 Ghana           1
## 2 Italy           2

Step 3: Select rows corresponding to the groups

Now that we have the two random countries we needed, we use that to select all rows corresponding to the random countries. We do that using right join on the grouped dataframe and the random country tibble.

gapminder %>% 
  group_by(country)  %>% 
  right_join(random_country, by=group_var) %>%
  group_by_(group_var) 

Note that this is a temporary solution as groupy_by_ is deprecated. And you would see the following warning.

## Warning: group_by_() is deprecated. 
## Please use group_by() instead
## 
## The 'programming' vignette or the tidyeval book can help you
## to program with group_by() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.

For the long term we need a better solution that uses tidy evaluation to use the grouping variable as variable. Not for now 🙂 . For now, we get the solution we wanted. Here we have two randomly selected countries and all their rows.

## # A tibble: 24 x 7
## # Groups:   country [2]
##    country continent  year lifeExp      pop gdpPercap unique_id
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>     <int>
##  1 Ghana   Africa     1952    43.1  5581001      911.         1
##  2 Ghana   Africa     1957    44.8  6391288     1044.         1
##  3 Ghana   Africa     1962    46.5  7355248     1190.         1
##  4 Ghana   Africa     1967    48.1  8490213     1126.         1
##  5 Ghana   Africa     1972    49.9  9354120     1178.         1
##  6 Ghana   Africa     1977    51.8 10538093      993.         1
##  7 Ghana   Africa     1982    53.7 11400338      876.         1
##  8 Ghana   Africa     1987    55.7 14168101      847.         1
##  9 Ghana   Africa     1992    57.5 16278738      925.         1
## 10 Ghana   Africa     1997    58.6 18418288     1005.         1
## # … with 14 more rows

One can also wrap this into a small function as shown below

sample_n_groups = function(grouped_df, size, replace = FALSE, weight=NULL) {
  grp_var <- grouped_df %>% 
    groups %>%
    unlist %>% 
    as.character
  random_grp <- grouped_df %>% 
    summarise() %>% 
    sample_n(size, replace, weight) %>% 
    mutate(unique_id = 1:NROW(.))
  grouped_df %>% 
    right_join(random_grp, by=grp_var) %>% 
    group_by_(grp_var) 
}

And use it with pipe operator.

set.seed(42)
gapminder %>% group_by(country) %>% sample_n_groups(2)

We will get the same results as before.

## # A tibble: 24 x 7
## # Groups:   country [2]
##    country continent  year lifeExp      pop gdpPercap unique_id
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>     <int>
##  1 Ghana   Africa     1952    43.1  5581001      911.         1
##  2 Ghana   Africa     1957    44.8  6391288     1044.         1
##  3 Ghana   Africa     1962    46.5  7355248     1190.         1
##  4 Ghana   Africa     1967    48.1  8490213     1126.         1
##  5 Ghana   Africa     1972    49.9  9354120     1178.         1
##  6 Ghana   Africa     1977    51.8 10538093      993.         1
##  7 Ghana   Africa     1982    53.7 11400338      876.         1
##  8 Ghana   Africa     1987    55.7 14168101      847.         1
##  9 Ghana   Africa     1992    57.5 16278738      925.         1
## 10 Ghana   Africa     1997    58.6 18418288     1005.         1
## # … with 14 more rows