How to Compute Summary Statistics Across Multiple Columns in R

dplyr’s groupby() function lets you group a dataframe by one or more variables and compute summary statistics on the other variables in a dataframe using summarize function.

Sometimes you might want to compute some summary statistics like mean/median or some other thing on multiple columns. Naive approach is to compute summary statistics by manually doing it one by one. One can immediately see that this is pretty coumbersome and may not possible sometimes.

Thanks to dplyr version 1.0.0, we now have a new function across(), which makes it easy to apply same function or transformation on multiple columns. Let us see an example of using dplyr’s across() and compute on multiple columns.

Let us get started by loading tidyverse, suite of R packages from RStudio.

library("tidyverse")

As before, we will use our favorite fantastic Penguins dataset to illustrate groupby and summary() functions. Let us load the data from cmdlinetips.com’ github page.

path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv"
penguins<- readr::read_csv(path2data)

penguins data has some missing values. Let us remove them using dplyr’s drop_na() function, which removes all rows with one or more missing values.

# remove rows with missing values
penguins <- penguins %>%
      drop_na()

Our dataframe contains both numerical and character values.

head(penguins, n=3)
## # A tibble: 3 x 7
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…

How to Apply Same Function Across Multiple Columns?

Let us consider an example of using across() function to compute summary statistics by specifying the first and last names of columns we want to use.

Here we apply mean function to compute mean values for each of the columns.

penguins %>%
  group_by(species) %>%
  summarise(across(bill_length_mm:body_mass_g, mean))

This approach worked in the above example, because the numerical variables are located continuously in the dataframe.

# # A tibble: 3 x 5
##   species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
## 1 Adelie              38.8          18.3              190.       3706.
## 2 Chinstrap           48.8          18.4              196.       3733.
## 3 Gentoo              47.6          15.0              217.       5092.

How to Compute Summary Statistics on Multiple Columns by Selecting Columns By Type?

A better way to use across() function to compute summary stats on multiple columns is to check the type of column and compute summary statistic.

In the example, below we compute the summary statistics mean if the column is of type numeric. To find all columns that are of type numeric we use “where(is.numeric)”.

penguins %>%
  group_by(species) %>%
  summarise(across(where(is.numeric), mean))

Now we get the same results as before, but this time we did not have think of the names of first and last columns or its order.

# # A tibble: 3 x 5
##   species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
## 1 Adelie              38.8          18.3              190.       3706.
## 2 Chinstrap           48.8          18.4              196.       3733.
## 3 Gentoo              47.6          15.0              217.       5092.

You might be tempted to use just “is.numeric” instead of where(is.numeric), but that option is deprecated and you will see useful warning as shown below.

Predicate functions must be wrapped in `where()`.

  # Bad
  data %>% select(is.numeric)

  # Good
  data %>% select(where(is.numeric))

How to Apply Same Function Across Multiple Columns and Specify Better Column Names?

In the above examples, we saw two ways to compute summary statistics using dplyr’s across() function. However, note that the column names of resulting tibble is same as the original dataframe and it is not meaningful. With dplyr’s across() function we can customize the column names on multiple columns easily and make them right.

In our examples, we applied mean functions on all columns and computed mean values. Therefore, it is meaningful to change column names to reflect that. We add “mean_” to each of the columns using “.names” argument to across() function as shown below.

# Use the .names argument to control the output names
penguins %>%
  group_by(species) %>%
  summarise(across(is.numeric, mean, .names = "mean_{col}"))

Note that each column has the prefix “mean_” now. Using across(), we have applied same function across multiple columns and have got names that are meaningful.

## # A tibble: 3 x 5
##   species  mean_bill_length… mean_bill_depth… mean_flipper_len… mean_body_mass_g
##   <chr>                <dbl>            <dbl>             <dbl>            <dbl>
## 1 Adelie                38.8             18.3              190.            3706.
## 2 Chinstr…              48.8             18.4              196.            3733.
## 3 Gentoo                47.6             15.0              217.            5092.