Sampling, randomly sub-setting, your data is often extremely useful in many situations. If you are interested in randomly sampling without regard to the groups, we can use sample_n() function from dplyr.
Sometimes you might want to sample one or multiple groups with all elements/rows within the selected group(s).
However, sampling one or more groups with all elements is slightly tricky.
Thankfully, someone else has already faced the same problem and has a solution. Here is the illustration of the randomly sampling groups in R with dplyr.
Let us first load the packages needed. We will use the dataset from gapminder package.
library(tidyverse) library(gapminder)
gapminder dataset has data lifeExp, population and gdpPercap for multiple continents, countries and years.
head(gapminder, n=3) ## # A tibble: 3 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853.
Let us say we want to randomly select 2 countries, from 142 countries, with all the data corresponding to these two countries.
Here is a step-by-step instruction to sample groups from a dataframe in R.
Step 1: Select Grouping variable
First, we need to choose which grouping variable of interest. In this example, our grouping variable is country.
group_var <- gapminder %>% group_by(country) %>% groups %>% unlist %>% as.character group_var ## [1] "country"
Step 2: Select groups randomly
Next we need to select two countries randomly, so that we can use it to select all data corresponding to these countries. Here we randomly select two countries and also assign unique ID for each country.
set.seed(42) random_country <- gapminder %>% group_by(country) %>% summarise() %>% sample_n(2) %>% mutate(unique_id=1:NROW(.)) random_country ## # A tibble: 2 x 2 ## country unique_id ## <fct> <int> ## 1 Ghana 1 ## 2 Italy 2
Step 3: Select rows corresponding to the groups
Now that we have the two random countries we needed, we use that to select all rows corresponding to the random countries. We do that using right join on the grouped dataframe and the random country tibble.
gapminder %>% group_by(country) %>% right_join(random_country, by=group_var) %>% group_by_(group_var)
Note that this is a temporary solution as groupy_by_ is deprecated. And you would see the following warning.
## Warning: group_by_() is deprecated. ## Please use group_by() instead ## ## The 'programming' vignette or the tidyeval book can help you ## to program with group_by() : https://tidyeval.tidyverse.org ## This warning is displayed once per session.
For the long term we need a better solution that uses tidy evaluation to use the grouping variable as variable. Not for now 🙂 . For now, we get the solution we wanted. Here we have two randomly selected countries and all their rows.
## # A tibble: 24 x 7 ## # Groups: country [2] ## country continent year lifeExp pop gdpPercap unique_id ## <fct> <fct> <int> <dbl> <int> <dbl> <int> ## 1 Ghana Africa 1952 43.1 5581001 911. 1 ## 2 Ghana Africa 1957 44.8 6391288 1044. 1 ## 3 Ghana Africa 1962 46.5 7355248 1190. 1 ## 4 Ghana Africa 1967 48.1 8490213 1126. 1 ## 5 Ghana Africa 1972 49.9 9354120 1178. 1 ## 6 Ghana Africa 1977 51.8 10538093 993. 1 ## 7 Ghana Africa 1982 53.7 11400338 876. 1 ## 8 Ghana Africa 1987 55.7 14168101 847. 1 ## 9 Ghana Africa 1992 57.5 16278738 925. 1 ## 10 Ghana Africa 1997 58.6 18418288 1005. 1 ## # … with 14 more rows
One can also wrap this into a small function as shown below
sample_n_groups = function(grouped_df, size, replace = FALSE, weight=NULL) { grp_var <- grouped_df %>% groups %>% unlist %>% as.character random_grp <- grouped_df %>% summarise() %>% sample_n(size, replace, weight) %>% mutate(unique_id = 1:NROW(.)) grouped_df %>% right_join(random_grp, by=grp_var) %>% group_by_(grp_var) }
And use it with pipe operator.
set.seed(42) gapminder %>% group_by(country) %>% sample_n_groups(2)
We will get the same results as before.
## # A tibble: 24 x 7 ## # Groups: country [2] ## country continent year lifeExp pop gdpPercap unique_id ## <fct> <fct> <int> <dbl> <int> <dbl> <int> ## 1 Ghana Africa 1952 43.1 5581001 911. 1 ## 2 Ghana Africa 1957 44.8 6391288 1044. 1 ## 3 Ghana Africa 1962 46.5 7355248 1190. 1 ## 4 Ghana Africa 1967 48.1 8490213 1126. 1 ## 5 Ghana Africa 1972 49.9 9354120 1178. 1 ## 6 Ghana Africa 1977 51.8 10538093 993. 1 ## 7 Ghana Africa 1982 53.7 11400338 876. 1 ## 8 Ghana Africa 1987 55.7 14168101 847. 1 ## 9 Ghana Africa 1992 57.5 16278738 925. 1 ## 10 Ghana Africa 1997 58.6 18418288 1005. 1 ## # … with 14 more rows