• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / R / tidyverse 101 / How to Compute Summary Statistics Across Multiple Columns in R

How to Compute Summary Statistics Across Multiple Columns in R

October 9, 2020 by cmdlinetips

dplyr’s groupby() function lets you group a dataframe by one or more variables and compute summary statistics on the other variables in a dataframe using summarize function.

Sometimes you might want to compute some summary statistics like mean/median or some other thing on multiple columns. Naive approach is to compute summary statistics by manually doing it one by one. One can immediately see that this is pretty coumbersome and may not possible sometimes.

Thanks to dplyr version 1.0.0, we now have a new function across(), which makes it easy to apply same function or transformation on multiple columns. Let us see an example of using dplyr’s across() and compute on multiple columns.

Let us get started by loading tidyverse, suite of R packages from RStudio.

library("tidyverse")

As before, we will use our favorite fantastic Penguins dataset to illustrate groupby and summary() functions. Let us load the data from cmdlinetips.com’ github page.

path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv"
penguins<- readr::read_csv(path2data)

penguins data has some missing values. Let us remove them using dplyr’s drop_na() function, which removes all rows with one or more missing values.

# remove rows with missing values
penguins <- penguins %>%
      drop_na()

Our dataframe contains both numerical and character values.

head(penguins, n=3)
## # A tibble: 3 x 7
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…

How to Apply Same Function Across Multiple Columns?

Let us consider an example of using across() function to compute summary statistics by specifying the first and last names of columns we want to use.

Here we apply mean function to compute mean values for each of the columns.

penguins %>%
  group_by(species) %>%
  summarise(across(bill_length_mm:body_mass_g, mean))

This approach worked in the above example, because the numerical variables are located continuously in the dataframe.

# # A tibble: 3 x 5
##   species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
## 1 Adelie              38.8          18.3              190.       3706.
## 2 Chinstrap           48.8          18.4              196.       3733.
## 3 Gentoo              47.6          15.0              217.       5092.

How to Compute Summary Statistics on Multiple Columns by Selecting Columns By Type?

A better way to use across() function to compute summary stats on multiple columns is to check the type of column and compute summary statistic.

In the example, below we compute the summary statistics mean if the column is of type numeric. To find all columns that are of type numeric we use “where(is.numeric)”.

penguins %>%
  group_by(species) %>%
  summarise(across(where(is.numeric), mean))

Now we get the same results as before, but this time we did not have think of the names of first and last columns or its order.

# # A tibble: 3 x 5
##   species   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
## 1 Adelie              38.8          18.3              190.       3706.
## 2 Chinstrap           48.8          18.4              196.       3733.
## 3 Gentoo              47.6          15.0              217.       5092.

You might be tempted to use just “is.numeric” instead of where(is.numeric), but that option is deprecated and you will see useful warning as shown below.

Predicate functions must be wrapped in `where()`.

  # Bad
  data %>% select(is.numeric)

  # Good
  data %>% select(where(is.numeric))

How to Apply Same Function Across Multiple Columns and Specify Better Column Names?

In the above examples, we saw two ways to compute summary statistics using dplyr’s across() function. However, note that the column names of resulting tibble is same as the original dataframe and it is not meaningful. With dplyr’s across() function we can customize the column names on multiple columns easily and make them right.

In our examples, we applied mean functions on all columns and computed mean values. Therefore, it is meaningful to change column names to reflect that. We add “mean_” to each of the columns using “.names” argument to across() function as shown below.

# Use the .names argument to control the output names
penguins %>%
  group_by(species) %>%
  summarise(across(is.numeric, mean, .names = "mean_{col}"))

Note that each column has the prefix “mean_” now. Using across(), we have applied same function across multiple columns and have got names that are meaningful.

## # A tibble: 3 x 5
##   species  mean_bill_length… mean_bill_depth… mean_flipper_len… mean_body_mass_g
##   <chr>                <dbl>            <dbl>             <dbl>            <dbl>
## 1 Adelie                38.8             18.3              190.            3706.
## 2 Chinstr…              48.8             18.4              196.            3733.
## 3 Gentoo                47.6             15.0              217.            5092.

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

4 ways to select columns with dplyr select()4 ways to select columns from a dataframe with dplyr’s select() Default Thumbnaildplyr groupby() and summarize(): Group By One or More Variables dplyr select(): How to Select Columns?dplyr select(): Select one or more variables from a dataframe Categorize columns with across()How To Categorize Multiple Numerical Columns in R

Filed Under: dplyr across(), tidyverse 101 Tagged With: dplyr across(), R, tidyverse 101

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version