How To Compute Column Means in R with tidyverse

How to Compute Column Means in R?
Compute Column Means in R with across() and colMeans()

In this bite sized post, we will see how to compute column means in R using tidyverse. We will compute column means for a couple of scenarios. First we will see how to compute column means of a dataframe with no missing values. And then we will compute column means with missing values.

We will use two R functions to compute column means. First, we we will see how to use across() function in dplyr 1.0.0+ to compute column means and then use base R’s colMeans() function to do the same.

Compute Column Means in R with across() and colMeans()

To get started, let us load tidyverse and data set needed to compute mean values of each numerical columns in a data frame.

library(tidyverse)
library(palmerpenguins)

Let us create two dataframes, one without any missing data.

data_without_na <- penguins %>%
  select(-year)%>%
  drop_na()

And the next dataframe without any missing values.

data_with_na <- penguins %>%
    select(-year)

Computing Column Means on data without missing data using across() function dplyr

Our dataframe contains both numerical and character variables. To compute means of all numerical columns, we use select() function to select the numerical columns. And then apply across() function on all columns to compute mean values. Note that we use across() function inside summarize() variable here.

data_without_na %>%
  select(where(is.numeric)) %>%
  summarise(across(everything(), mean))

Since our data does not contain any missing value we get a tibble with a single row containing column means.

## # A tibble: 1 x 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <dbl>       <dbl>
## 1           44.0          17.2              201.       4207.

We can skip the selection of numerical variables using select() function. Here, we select all numerical columns inside across() function and compute mean values.

data_without_na %>%
  summarise(across(where(is.numeric), mean))

As expected we get the same results.

## # A tibble: 1 x 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <dbl>       <dbl>
## 1           44.0          17.2              201.       4207.

How to Compute Column Means on data with missing data using across() function dplyr

When our data frame contains missing values, we have to instruct to ignore or remove the missing values them to compute mean values.

Let us try to to compute column means without specifying to remove the missing values.

data_with_na %>%
  select(where(is.numeric)) %>%
  summarise(across(everything(), 
                   mean))

Then we get the following results, where all the column’s mean values are NA.

## # A tibble: 1 x 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <dbl>       <dbl>
## 1             NA            NA                NA          NA

To remove missing values in the data, we use “na.rm=TRUE” argument to across() function.

data_with_na %>%
  select(where(is.numeric)) %>%
  summarise(across(everything(), 
                   mean,
                   na.rm = TRUE))

And we get columns means as expected.

## # A tibble: 1 x 4
##   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##            <dbl>         <dbl>             <dbl>       <dbl>
## 1           43.9          17.2              201.       4202.

As before, we can skip separate select() statement and compute numerical column’s mean values using across() function.

data_with_na %>%
  summarise(across(where(is.numeric),
                   mean,
                   na.rm = TRUE))

How to Compute Column Means with colMeans() function

Another easy approach to compute column means is to use base R’s colMeans() function. Here we select numerical columns first and use colMeans() with na.rm argument to compute mean values by removing any missing data.

data_with_na %>%
  select(where(is.numeric)) %>% 
  colMeans(na.rm = TRUE)
##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##          43.92193          17.15117         200.91520        4201.75439

In summary, we saw examples of using two functions in R, across() and colMeans() to compute column means on numerical columns with and without missing data.