In this post, we will learn how to randomly sample rows from a data frame that is useful in most common scenarios. Tidyverse has a few options to randomly sample rows from a dataframe. slice_sample() in dplyr is the currently recommended function to use for randomly select rows.
The older function in dplyr, sample_n(), for randomly sampling is suoerceded by slice_sample(). Note superceded does not mean the function will go away. As per the lifecycle stage definition,
A superseded function has a known better alternative, but the function itself is not going away
Here we will focus mainly on using slice_sample() function to randomly select rows from a data frame. The basic syntax for using slice_sample is as follows
slice_sample(.data, ..., n, prop, weight_by = NULL, replace = FALSE)
Using slice_sample(), we will learn how to
* select n rows randomly without and with replacement
* select a proportion of rows randomly without and with replacement
* select n rows per group defined by a variable without and with replacement
* select a proportion of rows per group defined by a variable without and with replacement
Let us get started by loading the packages needed to learn how to use slice_sample() function ins dplyr.
library(tidyverse) library(palmerpenguins) theme_set(theme_bw(16))
We will use toy dataframe with just two columns chosen from palmer penguin dataset to illustrate the use of slice_sample(). Bear with me on the use of slice_sample() function to randomly select 12 rows from the dataframe.
set.seed(2025) df <- penguins %>% select(species, body_mass_g) %>% slice_sample(n=12)
Now we have a toy dataframe with two columns and 12 rows.
df ## # A tibble: 12 × 2 ## species body_mass_g ## <fct> <int> ## 1 Chinstrap 3650 ## 2 Gentoo 5150 ## 3 Gentoo 5500 ## 4 Adelie 3550 ## 5 Adelie 2850 ## 6 Gentoo 4800 ## 7 Adelie 3475 ## 8 Adelie 3950 ## 9 Chinstrap 3775 ## 10 Adelie 3600 ## 11 Chinstrap 4150 ## 12 Chinstrap 3700
1. Sample n rows without replacement
To randomly select n rows from a dataframe without replacement, we use slice_sample() with n as argument. In the example below we randomly select 5 rows.
df %>% slice_sample(n=5) ## # A tibble: 5 × 2 ## species body_mass ## <chr> <dbl> ## 1 Gentoo 4400 ## 2 Adelie 5000 ## 3 Adelie 2200 ## 4 Adelie 3400 ## 5 Chinstrap 4200
2. Sample n rows with replacement
To randomly select n rows from a dataframe with replacement, we use slice_sample() with n and replace=TRUE as arguments . In the example below we randomly select 5 rows with replacement. Note sampling with replacement can give us the same row again. For example, we have the 3rd and 4th rows are duplicates because we sampled with replacement.
df %>% slice_sample(n=5, replace=TRUE) ## # A tibble: 5 × 2 ## species body_mass ## <chr> <dbl> ## 1 Gentoo 4000 ## 2 Gentoo 2400 ## 3 Chinstrap 2000 ## 4 Chinstrap 2000 ## 5 Chinstrap 2600
3. Sample a proportion without replacement
In order to select a proportion of rows instead of a fixed number of rows, we use prop argument to slice_sample() function. To randomly select 50% of the rows from the dataframe without replacement, we use prop=0.5 as argument to slice_sample() function.
df %>% slice_sample(prop=0.5) ## # A tibble: 6 × 2 ## species body_mass ## <chr> <dbl> ## 1 Adelie 3400 ## 2 Gentoo 4000 ## 3 Chinstrap 2600 ## 4 Gentoo 2400 ## 5 Chinstrap 4200 ## 6 Gentoo 4600
4. Sample a proportion of rows with replacement
To randomly select a proportion of rows with replacement, we use replace=TRUE argument in addition to prop argument to slice_sample() function. To randomly select 50% of the rows from the dataframe without replacement, we use prop=0.5 as argument to slice_sample() function.
df %>% slice_sample(prop=0.5, replace=TRUE) ## # A tibble: 6 × 2 ## species body_mass ## <chr> <dbl> ## 1 Chinstrap 2600 ## 2 Chinstrap 2000 ## 3 Adelie 3600 ## 4 Gentoo 2400 ## 5 Adelie 3600 ## 6 Gentoo 4000
5. Sample n rows weighted by a column (without replacement)
To select random n rows weighted by one of the variables in the dataframe, we use weight_by argument with slice_sample() function from dplyr.
In this example, we are randomly selecting 5 rows without replacement, but weighted by “body_mass” one of the columns in the dataframe. The weight_by argument, will randomly select rows with larger body mass.
df %>% slice_sample(n=5, weight_by=body_mass) ## # A tibble: 5 × 2 ## species body_mass ## <chr> <dbl> ## 1 Gentoo 4600 ## 2 Adelie 5000 ## 3 Adelie 3400 ## 4 Adelie 2200 ## 5 Chinstrap 2000
6. Sample n rows weighted by a column (with replacement)
To select random n rows with replacement and weighted by a variable, we need to provide three arguments, n, weight_by, and replace=TRUE, with slice_sample() function from dplyr.
In this example, we are randomly selecting 5 rows with replacement. Since it is weighted by “body_mass” we will get random rows with larger body mass.
df %>% slice_sample(n=5, weight_by=body_mass, replace=TRUE) ## # A tibble: 5 × 2 ## species body_mass ## <chr> <dbl> ## 1 Gentoo 4400 ## 2 Adelie 2200 ## 3 Adelie 3600 ## 4 Adelie 3400 ## 5 Chinstrap 2600
7. Sample n rows in each group without replacement
Another common use case of randomly sampling is randomly select rows within each group. In order to randomly select n rows per each group, where a group is defined by a variable in the data, we first need to use group the data by group_by() using the variable as argument. Then we can apply slice_sample() function to randomly select rows. In this example below, we randomly select 2 rows per each “species” group.
df %>% group_by(species) %>% slice_sample(n=2)
8. Sample n rows in each group with replacement
To randomly select n rows per each group with replacement, is to use replace=TRUE and n as arguments after group_by() statement. In the example below, we randomly select 2 rows per each group with replacement.
df %>% group_by(species) %>% slice_sample(n=2, replace=TRUE)
9. Sample a proportion within each group without replacement
In stead of a specific number per each group, we can randomly sample a proportion per group using prop argument to slice_sample() after grouping by a variable.
In this scenario, we may get different number of rows per groups as each group can have different number of rows.
df %>% group_by(species) %>% slice_sample(prop=.5, replace=FALSE) ## # A tibble: 5 × 2 ## # Groups: species [3] ## species body_mass ## <chr> <dbl> ## 1 Adelie 3400 ## 2 Adelie 3600 ## 3 Chinstrap 4200 ## 4 Gentoo 4400 ## 5 Gentoo 2800
10. Sample a proportion within each group with replacement
We need to use replace=TRUE and prop arguments to slice_sample() function in dplyr after group_by() to randomly select a proportion within each group.
df %>% group_by(species) %>% slice_sample(prop=.5, replace=TRUE) ## # A tibble: 5 × 2 ## # Groups: species [3] ## species body_mass ## <chr> <dbl> ## 1 Adelie 3400 ## 2 Adelie 5000 ## 3 Chinstrap 2000 ## 4 Gentoo 4400 ## 5 Gentoo 2800
11. Sample random groups without replacement
Sometimes we might me interested in randomly sampling groups from one or more variables. The basic idea to randomly select groups is to get the groups with count() function.
#sample groups randomly df %>% count(species) ## # A tibble: 3 × 2 ## species n ## <chr> <int> ## 1 Adelie 4 ## 2 Chinstrap 3 ## 3 Gentoo 5
And then use slice_sample() to select fixed number of groups or a proportion of groups. In the example we select two groups randomly.
df %>% count(species) %>% select(-n)%>% slice_sample(n=2) ## # A tibble: 2 × 1 ## species ## <chr> ## 1 Adelie ## 2 Chinstrap
12. Randomly Sample a proportion of rows from random groups without replacement
Another related scenario is to randomly select a proportion of rows from randomly selected groups. To do this we will use slice_sample() function twice. First to randomly select the groups and next to select rows randomly from the chosen groups.
In the example below, we randomly select two groups first and then join with original data to get all the data from those two groups. After that we use slice_sample() to select 50% of the rows randomly.
set.seed(123) df %>% count(species) %>% select(-n)%>% slice_sample(n=2) %>% left_join(df, by="species") %>% slice_sample(prop=0.5) ## # A tibble: 4 × 2 ## species body_mass ## <chr> <dbl> ## 1 Gentoo 4000 ## 2 Adelie 2200 ## 3 Adelie 3400 ## 4 Gentoo 4600
13. Randomly Sample n rows from random groups without replacement
Similarly to randomly select n rows from randomly chosen groups we need to use slice_sample() function twice. In the example below, we have randomly selected two groups and then randomly selected 5 rows in total. As before with first slice_sample() we select random groups and then join groups with the original data get all the rows corresponding to the two groups. After that we use the second slice_sample() to randomly select 5 rows.
set.seed(123) df %>% count(species) %>% select(-n)%>% slice_sample(n=2) %>% left_join(df, by="species") %>% slice_sample(n=5) ## # A tibble: 5 × 2 ## species body_mass ## <chr> <dbl> ## 1 Gentoo 4000 ## 2 Adelie 2200 ## 3 Adelie 3400 ## 4 Gentoo 4600 ## 5 Adelie 5000