• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / R / tidyverse 101 / dplyr groupby() and summarize(): Group By One or More Variables

dplyr groupby() and summarize(): Group By One or More Variables

August 31, 2020 by cmdlinetips

dplyr, is a R package provides that provides a great set of tools to manipulate datasets in the tabular form. dplyr has a set of core functions for “data munging”,including select(),mutate(), filter(), groupby() & summarise(), and arrange().

dplyr’s groupby() function is the at the core of Hadley Wickham’ Split-Apply-Combine paradigm useful for most common data analysis.

Many data analysis problems involve the application of a split-apply-combine strategy, where you break up a big problem into manageable pieces, operate on each piece independently and then put all the pieces back together.

Check out the original paper introducing the strategy by Hadley Wickham and it is a must read.

Group By operation is at the heart of this useful data analysis strategy. And in this tidyverse tutorial, we will learn how to use dplyr’s groupby() and summarise() functions to group the data frame by one or more variables and compute one or more summary statistics using summarise() function.

First we will start with how to group a dataframe by a single variable and compute one summary level statistics. , And then we will learn how to compute multiple summary values.

Let us get started by loading tidyverse, suite of R packages from RStudio.

library("tidyverse")

We will use our favorite fantastic Penguins dataset to illustrate groupby and summary() functions. Let us load the data from cmdlinetips.com’ github page.

path2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv"
penguins<- readr::read_csv(path2data)
## Parsed with column specification:
## cols(
##   species = col_character(),
##   island = col_character(),
##   bill_length_mm = col_double(),
##   bill_depth_mm = col_double(),
##   flipper_length_mm = col_double(),
##   body_mass_g = col_double(),
##   sex = col_character()
## )
# remove rows with missing values
penguins <- penguins %>% 
  drop_na()
head(penguins)

## # A tibble: 6 x 7
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <chr>   <chr>           <dbl>         <dbl>            <dbl>       <dbl> <chr>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 5 Adelie  Torge…           39.3          20.6              190        3650 male 
## 6 Adelie  Torge…           38.9          17.8              181        3625 fema…

groupby() with single column

Let us first use groupby() on a single variable in our dataframe. When we use groupby() function, in this example on a single variable, under the hood it splits the dataframe into multiple smaller dataframes such that there is a smaller dataframe for each value of the variable we used with groupby.

For example, when we use groupby() function on sex variable with two values Male and Female, groupby() function splits the original dataframe into two smaller dataframes one for “Male and the other for “Female”.

Then when we use summarize() function it computes some summary statistics on each smaller dataframe and gives us a new dataframe.

penguins %>% 
  group_by(sex) %>%
  summarize(ave_bill_length_mm=mean(bill_length_mm))

In our example, we have got mean bill length for each values of sex.

## # A tibble: 2 x 2
##   sex    ave_bill_length_mm
##   <chr>               <dbl>
## 1 female               42.1
## 2 male                 45.9

groupby() with single variable and multiple summary stats

We can also use groupby() on single variable and do computation on multiple variables. In this example, we groupby() species variable and compute two summary statistics, mean flipper length and body mass.

penguins %>% 
  group_by(species) %>%
  summarize(ave_flipper_length_mm=mean(flipper_length_mm),
            ave_body_mass_g=mean(body_mass_g))
## # A tibble: 3 x 3
##   species   ave_flipper_length_mm ave_body_mass_g
##   <chr>                     <dbl>           <dbl>
## 1 Adelie                     190.           3706.
## 2 Chinstrap                  196.           3733.
## 3 Gentoo                     217.           5092.

groupby() with multiple variables

We can also use groupby() on multiple variables and use summarize() on multiple varaibles. In the example below, we groupby() on species and sex and compute two summary stats for each combination of species and sex values.

penguins %>%
  group_by(species,sex) %>%
  summarize(ave_flipper_length_mm=mean(flipper_length_mm),
            ave_body_mass_g=mean(body_mass_g))

Our resulting tibble has 6 rows corresponding to the six combinations of species and sex values.

## # A tibble: 6 x 4
## # Groups:   species [3]
##   species   sex    ave_flipper_length_mm ave_body_mass_g
##   <chr>     <chr>                  <dbl>           <dbl>
## 1 Adelie    female                  188.           3369.
## 2 Adelie    male                    192.           4043.
## 3 Chinstrap female                  192.           3527.
## 4 Chinstrap male                    200.           3939.
## 5 Gentoo    female                  213.           4680.
## 6 Gentoo    male                    222.           5485.

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

dplyr select(): How to Select Columns?dplyr select(): Select one or more variables from a dataframe Default Thumbnaildplyr mutate(): Create New Variables with mutate Default Thumbnaildplyr arrange(): Sort/Reorder by One or More Variables Default Thumbnaildplyr filter(): Filter/Select Rows based on conditions

Filed Under: dplyr groupby(), tidyverse 101 Tagged With: R

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version