13 Tips to Randomly Select Rows with tidyverse

In this post, we will learn how to randomly sample rows from a data frame that is useful in most common scenarios. Tidyverse has a few options to randomly sample rows from a dataframe. slice_sample() in dplyr is the currently recommended function to use for randomly select rows.

The older function in dplyr, sample_n(), for randomly sampling is suoerceded by slice_sample(). Note superceded does not mean the function will go away. As per the lifecycle stage definition,

A superseded function has a known better alternative, but the function itself is not going away

Here we will focus mainly on using slice_sample() function to randomly select rows from a data frame. The basic syntax for using slice_sample is as follows

slice_sample(.data, 
             ..., 
             n,
             prop,
             weight_by = NULL,
             replace = FALSE)

Using slice_sample(), we will learn how to
* select n rows randomly without and with replacement
* select a proportion of rows randomly without and with replacement
* select n rows per group defined by a variable without and with replacement
* select a proportion of rows per group defined by a variable without and with replacement

Let us get started by loading the packages needed to learn how to use slice_sample() function ins dplyr.

library(tidyverse)
library(palmerpenguins)
theme_set(theme_bw(16))

We will use toy dataframe with just two columns chosen from palmer penguin dataset to illustrate the use of slice_sample(). Bear with me on the use of slice_sample() function to randomly select 12 rows from the dataframe.

set.seed(2025)
df <- penguins %>% 
  select(species, body_mass_g) %>%
  slice_sample(n=12)

Now we have a toy dataframe with two columns and 12 rows.

df 

## # A tibble: 12 × 2
##    species   body_mass_g
##    <fct>           <int>
##  1 Chinstrap        3650
##  2 Gentoo           5150
##  3 Gentoo           5500
##  4 Adelie           3550
##  5 Adelie           2850
##  6 Gentoo           4800
##  7 Adelie           3475
##  8 Adelie           3950
##  9 Chinstrap        3775
## 10 Adelie           3600
## 11 Chinstrap        4150
## 12 Chinstrap        3700

1. Sample n rows without replacement

To randomly select n rows from a dataframe without replacement, we use slice_sample() with n as argument. In the example below we randomly select 5 rows.

df %>% 
  slice_sample(n=5)

## # A tibble: 5 × 2
##   species   body_mass
##   <chr>         <dbl>
## 1 Gentoo         4400
## 2 Adelie         5000
## 3 Adelie         2200
## 4 Adelie         3400
## 5 Chinstrap      4200

2. Sample n rows with replacement

To randomly select n rows from a dataframe with replacement, we use slice_sample() with n and replace=TRUE as arguments . In the example below we randomly select 5 rows with replacement. Note sampling with replacement can give us the same row again. For example, we have the 3rd and 4th rows are duplicates because we sampled with replacement.

df %>% 
  slice_sample(n=5, replace=TRUE)

## # A tibble: 5 × 2
##   species   body_mass
##   <chr>         <dbl>
## 1 Gentoo         4000
## 2 Gentoo         2400
## 3 Chinstrap      2000
## 4 Chinstrap      2000
## 5 Chinstrap      2600

3. Sample a proportion without replacement

In order to select a proportion of rows instead of a fixed number of rows, we use prop argument to slice_sample() function. To randomly select 50% of the rows from the dataframe without replacement, we use prop=0.5 as argument to slice_sample() function.

df %>% 
  slice_sample(prop=0.5)
## # A tibble: 6 × 2
##   species   body_mass
##   <chr>         <dbl>
## 1 Adelie         3400
## 2 Gentoo         4000
## 3 Chinstrap      2600
## 4 Gentoo         2400
## 5 Chinstrap      4200
## 6 Gentoo         4600

4. Sample a proportion of rows with replacement

To randomly select a proportion of rows with replacement, we use replace=TRUE argument in addition to prop argument to slice_sample() function. To randomly select 50% of the rows from the dataframe without replacement, we use prop=0.5 as argument to slice_sample() function.

df %>% 
  slice_sample(prop=0.5, replace=TRUE)

## # A tibble: 6 × 2
##   species   body_mass
##   <chr>         <dbl>
## 1 Chinstrap      2600
## 2 Chinstrap      2000
## 3 Adelie         3600
## 4 Gentoo         2400
## 5 Adelie         3600
## 6 Gentoo         4000

5. Sample n rows weighted by a column (without replacement)

To select random n rows weighted by one of the variables in the dataframe, we use weight_by argument with slice_sample() function from dplyr.

In this example, we are randomly selecting 5 rows without replacement, but weighted by “body_mass” one of the columns in the dataframe. The weight_by argument, will randomly select rows with larger body mass.

df %>% 
  slice_sample(n=5, weight_by=body_mass)

## # A tibble: 5 × 2
##   species   body_mass
##   <chr>         <dbl>
## 1 Gentoo         4600
## 2 Adelie         5000
## 3 Adelie         3400
## 4 Adelie         2200
## 5 Chinstrap      2000

6. Sample n rows weighted by a column (with replacement)

To select random n rows with replacement and weighted by a variable, we need to provide three arguments, n, weight_by, and replace=TRUE, with slice_sample() function from dplyr.

In this example, we are randomly selecting 5 rows with replacement. Since it is weighted by “body_mass” we will get random rows with larger body mass.

df %>% 
  slice_sample(n=5, weight_by=body_mass, replace=TRUE)

## # A tibble: 5 × 2
##   species   body_mass
##   <chr>         <dbl>
## 1 Gentoo         4400
## 2 Adelie         2200
## 3 Adelie         3600
## 4 Adelie         3400
## 5 Chinstrap      2600

7. Sample n rows in each group without replacement

Another common use case of randomly sampling is randomly select rows within each group. In order to randomly select n rows per each group, where a group is defined by a variable in the data, we first need to use group the data by group_by() using the variable as argument. Then we can apply slice_sample() function to randomly select rows. In this example below, we randomly select 2 rows per each “species” group.

df %>% 
  group_by(species) %>%
  slice_sample(n=2)

8. Sample n rows in each group with replacement

To randomly select n rows per each group with replacement, is to use replace=TRUE and n as arguments after group_by() statement. In the example below, we randomly select 2 rows per each group with replacement.

df %>% 
  group_by(species) %>%
  slice_sample(n=2, replace=TRUE)

9. Sample a proportion within each group without replacement

In stead of a specific number per each group, we can randomly sample a proportion per group using prop argument to slice_sample() after grouping by a variable.

In this scenario, we may get different number of rows per groups as each group can have different number of rows.

df %>% 
  group_by(species) %>%
  slice_sample(prop=.5, replace=FALSE)

## # A tibble: 5 × 2
## # Groups:   species [3]
##   species   body_mass
##   <chr>         <dbl>
## 1 Adelie         3400
## 2 Adelie         3600
## 3 Chinstrap      4200
## 4 Gentoo         4400
## 5 Gentoo         2800

10. Sample a proportion within each group with replacement

We need to use replace=TRUE and prop arguments to slice_sample() function in dplyr after group_by() to randomly select a proportion within each group.

df %>% 
  group_by(species) %>%
  slice_sample(prop=.5, replace=TRUE)

## # A tibble: 5 × 2
## # Groups:   species [3]
##   species   body_mass
##   <chr>         <dbl>
## 1 Adelie         3400
## 2 Adelie         5000
## 3 Chinstrap      2000
## 4 Gentoo         4400
## 5 Gentoo         2800

11. Sample random groups without replacement

Sometimes we might me interested in randomly sampling groups from one or more variables. The basic idea to randomly select groups is to get the groups with count() function.

#sample groups randomly
df %>% 
  count(species)

## # A tibble: 3 × 2
##   species       n
##   <chr>     <int>
## 1 Adelie        4
## 2 Chinstrap     3
## 3 Gentoo        5

And then use slice_sample() to select fixed number of groups or a proportion of groups. In the example we select two groups randomly.

df %>% 
  count(species) %>%
  select(-n)%>%
  slice_sample(n=2)

## # A tibble: 2 × 1
##   species  
##   <chr>    
## 1 Adelie   
## 2 Chinstrap

12. Randomly Sample a proportion of rows from random groups without replacement

Another related scenario is to randomly select a proportion of rows from randomly selected groups. To do this we will use slice_sample() function twice. First to randomly select the groups and next to select rows randomly from the chosen groups.
In the example below, we randomly select two groups first and then join with original data to get all the data from those two groups. After that we use slice_sample() to select 50% of the rows randomly.

set.seed(123)
df %>% 
  count(species) %>%
  select(-n)%>%
  slice_sample(n=2) %>%
  left_join(df, by="species") %>%
  slice_sample(prop=0.5)

## # A tibble: 4 × 2
##   species body_mass
##   <chr>       <dbl>
## 1 Gentoo       4000
## 2 Adelie       2200
## 3 Adelie       3400
## 4 Gentoo       4600

13. Randomly Sample n rows from random groups without replacement

Similarly to randomly select n rows from randomly chosen groups we need to use slice_sample() function twice. In the example below, we have randomly selected two groups and then randomly selected 5 rows in total. As before with first slice_sample() we select random groups and then join groups with the original data get all the rows corresponding to the two groups. After that we use the second slice_sample() to randomly select 5 rows.

set.seed(123)
df %>% 
  count(species) %>%
  select(-n)%>%
  slice_sample(n=2) %>%
  left_join(df, by="species") %>%
  slice_sample(n=5)

## # A tibble: 5 × 2
##   species body_mass
##   <chr>       <dbl>
## 1 Gentoo       4000
## 2 Adelie       2200
## 3 Adelie       3400
## 4 Gentoo       4600
## 5 Adelie       5000