• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / R / tidyverse / Sample groups tidyverse / How to Randomly Select Groups in R with dplyr?

How to Randomly Select Groups in R with dplyr?

July 24, 2019 by cmdlinetips

Sampling, randomly sub-setting, your data is often extremely useful in many situations. If you are interested in randomly sampling without regard to the groups, we can use sample_n() function from dplyr.

Sometimes you might want to sample one or multiple groups with all elements/rows within the selected group(s).

However, sampling one or more groups with all elements is slightly tricky.

Thankfully, someone else has already faced the same problem and has a solution. Here is the illustration of the randomly sampling groups in R with dplyr.

Let us first load the packages needed. We will use the dataset from gapminder package.

library(tidyverse)
library(gapminder)

gapminder dataset has data lifeExp, population and gdpPercap for multiple continents, countries and years.

head(gapminder, n=3)

## # A tibble: 3 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333      779.
## 2 Afghanistan Asia       1957    30.3  9240934      821.
## 3 Afghanistan Asia       1962    32.0 10267083      853.

Let us say we want to randomly select 2 countries, from 142 countries, with all the data corresponding to these two countries.

Here is a step-by-step instruction to sample groups from a dataframe in R.

Step 1: Select Grouping variable

First, we need to choose which grouping variable of interest. In this example, our grouping variable is country.

group_var <- gapminder %>% 
  group_by(country) %>%
  groups %>%
  unlist %>% 
  as.character

group_var

## [1] "country"

Step 2: Select groups randomly

Next we need to select two countries randomly, so that we can use it to select all data corresponding to these countries. Here we randomly select two countries and also assign unique ID for each country.

set.seed(42)
random_country <- gapminder %>% 
  group_by(country) %>% 
  summarise() %>% 
  sample_n(2) %>% 
  mutate(unique_id=1:NROW(.))

random_country
## # A tibble: 2 x 2
##   country unique_id
##   <fct>       <int>
## 1 Ghana           1
## 2 Italy           2

Step 3: Select rows corresponding to the groups

Now that we have the two random countries we needed, we use that to select all rows corresponding to the random countries. We do that using right join on the grouped dataframe and the random country tibble.

gapminder %>% 
  group_by(country)  %>% 
  right_join(random_country, by=group_var) %>%
  group_by_(group_var) 

Note that this is a temporary solution as groupy_by_ is deprecated. And you would see the following warning.

## Warning: group_by_() is deprecated. 
## Please use group_by() instead
## 
## The 'programming' vignette or the tidyeval book can help you
## to program with group_by() : https://tidyeval.tidyverse.org
## This warning is displayed once per session.

For the long term we need a better solution that uses tidy evaluation to use the grouping variable as variable. Not for now 🙂 . For now, we get the solution we wanted. Here we have two randomly selected countries and all their rows.

## # A tibble: 24 x 7
## # Groups:   country [2]
##    country continent  year lifeExp      pop gdpPercap unique_id
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>     <int>
##  1 Ghana   Africa     1952    43.1  5581001      911.         1
##  2 Ghana   Africa     1957    44.8  6391288     1044.         1
##  3 Ghana   Africa     1962    46.5  7355248     1190.         1
##  4 Ghana   Africa     1967    48.1  8490213     1126.         1
##  5 Ghana   Africa     1972    49.9  9354120     1178.         1
##  6 Ghana   Africa     1977    51.8 10538093      993.         1
##  7 Ghana   Africa     1982    53.7 11400338      876.         1
##  8 Ghana   Africa     1987    55.7 14168101      847.         1
##  9 Ghana   Africa     1992    57.5 16278738      925.         1
## 10 Ghana   Africa     1997    58.6 18418288     1005.         1
## # … with 14 more rows

One can also wrap this into a small function as shown below

sample_n_groups = function(grouped_df, size, replace = FALSE, weight=NULL) {
  grp_var <- grouped_df %>% 
    groups %>%
    unlist %>% 
    as.character
  random_grp <- grouped_df %>% 
    summarise() %>% 
    sample_n(size, replace, weight) %>% 
    mutate(unique_id = 1:NROW(.))
  grouped_df %>% 
    right_join(random_grp, by=grp_var) %>% 
    group_by_(grp_var) 
}

And use it with pipe operator.

set.seed(42)
gapminder %>% group_by(country) %>% sample_n_groups(2)

We will get the same results as before.

## # A tibble: 24 x 7
## # Groups:   country [2]
##    country continent  year lifeExp      pop gdpPercap unique_id
##    <fct>   <fct>     <int>   <dbl>    <int>     <dbl>     <int>
##  1 Ghana   Africa     1952    43.1  5581001      911.         1
##  2 Ghana   Africa     1957    44.8  6391288     1044.         1
##  3 Ghana   Africa     1962    46.5  7355248     1190.         1
##  4 Ghana   Africa     1967    48.1  8490213     1126.         1
##  5 Ghana   Africa     1972    49.9  9354120     1178.         1
##  6 Ghana   Africa     1977    51.8 10538093      993.         1
##  7 Ghana   Africa     1982    53.7 11400338      876.         1
##  8 Ghana   Africa     1987    55.7 14168101      847.         1
##  9 Ghana   Africa     1992    57.5 16278738      925.         1
## 10 Ghana   Africa     1997    58.6 18418288     1005.         1
## # … with 14 more rows

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default ThumbnailHow To Randomly Select Rows in Pandas? Default Thumbnaildplyr filter(): Filter/Select Rows based on conditions dplyr 1.0.0 is heredplyr 1.0.0 is here: Quick fun with Summarise() and rowwise() dplyr select(): How to Select Columns?dplyr select(): Select one or more variables from a dataframe

Filed Under: Sample groups tidyverse, Sample n Groups Tagged With: Randomly Select n Groups, Sample groups tidyverse, Sample n Groups

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version