How To Make Grouped Boxplots with ggplot2?

Customizing Grouped Boxplot
Customizing Grouped Boxplot in R

Boxplots are great to visualize distributions of multiple variables. ggplot2 is great to make beautiful boxplots really quickly. Sometimes, you may have multiple sub-groups for a variable of interest. In those situation, it is very useful to visualize using “grouped boxplots”. In R, ggplot2 package offers multiple options to visualize such grouped boxplots.

Let us load tidyverse and gapminder data package.

library(tidyverse)
library(gapminder)

Let us have a peek at the gapminder data set. The gapminder data frame has six columns or variables, therefore it is ideal to illustrate grouped boxplots with ggplot2.

print(head(gapminder, n=3))

country . year . pop . continent . lifeExp . gdpPercap
<fctr> . <int> . <dbl> . <fctr> . <dbl> . <dbl>
1	Afghanistan	1952	8425333	Asia	28.801	779.4453
2	Afghanistan	1957	9240934	Asia	30.332	820.8530
3	Afghanistan	1962	10267083	Asia	31.997	853.1007

Let us first make a simple boxplot showing the actual data with jitter.

gapminder %>% 
  ggplot(aes(x=continent,y=lifeExp, fill=continent)) +
  geom_boxplot() + geom_jitter(width=0.1,alpha=0.2) 

Note that we specify x-axis and y-axis variables in the aesthetics. In addition, we also specify “fill=continent” to color out boxplots by continent. Then we ad two layers of geom, geom_boxplot for showing the boxplot and geom_jitter for showing the data points with jitter.

Simple Boxplot with ggplot2

Making Grouped Boxplot with ggplot2: First Try

Let us make grouped boxplot using the gapminder dataset with ggplot. The key idea to make a grouped boxplot is to use fill argument inside ggplot’s aesthetics.

Let us say, we want to make a grouped boxplot showing the life expectancy over multiple years for each continent. Our gapminder data frame has year variable and has data from multiple years. Let us make a simpler data frame with just data for three years, 1952,1987, and 2007. We will use filter function in dplyr to filter the data for the three years of interest and feed the resulting data frame for making a grouped boxplot.

Since we want to use year as grouping variable, we can simply specify “fill=year” in addition to the x-axis and y-axis, and make a boxplot with geom_boxplot(), as shown below. For the sake of simplicity, we just have one geom layer; geom_boxplot().

gapminder %>% 
  filter(year %in% c(1952,1987,2007)) %>%
  ggplot(aes(x=continent, y=lifeExp, fill=year)) +
  geom_boxplot() 

However, the resulting boxplot is just a simple boxplot, not a grouped boxplot as we wanted. Something is definitely wrong here. The reason is that if you look at the type of the variable “year” (see with head(gapminder)), you can see that the variable year is of “int” type. That is the reason we did not get the grouped boxplot.

Grouped Boxplot: First Try

Making Grouped Boxplot with ggplot2

In order to make grouped boxplot using ggplot2, the group variable should be a categorical variable not numerical. We can specify that the year is categorical variable by using factor(year) and giving that to the fill argument inside aesthetics.

gapminder %>% 
  filter(year %in% c(1957,1987,2007)) %>%
  ggplot(aes(x=continent, y=lifeExp, fill=factor(year))) +
  geom_boxplot() 

Now we have a nice grouped boxplot as we originally intended. For each continent, we have three boxplots; one for each year.

Grouped Boxplot in ggplot2

Customizing Grouped Boxplot with ggplot2

Let us customize the grouped boxplot a bit. Let us do three simple customizations.

First, note that legend of the grouped boxplot we just made still says “factor(year)”. Let us fix that by replacing it to just Year. We can change the legend to what we want by using the layer labs with fill argument, labs(fill = “Year”)

Second, let us show actual data points on the boxplot. Like before, we will jitter the data points. If we simply add geom_point() as a layer, it will the original data points, but not for every grouped boxplots. Similarly, if we use geom_jitter(), the width of the jitter is bit hard to adjust. So, we use geom_point(position=position_jitterdodge()) here.

gapminder %>% 
  filter(year %in% c(1957,1987,2007)) %>%
  ggplot(aes(x=continent,y=lifeExp, fill=factor(year))) +
  geom_boxplot() + 
  labs(fill = "Year") + 
  geom_point(position=position_jitterdodge(),alpha=0.3) +
  theme_bw(base_size = 16)
Customizing Grouped Boxplot in R

Grouped Boxplots with facets in ggplot2

Another way to make grouped boxplot is to use facet in ggplot. facet-ing functons in ggplot2 offers general solution to split up the data by one or more variables and make plots with subsets of data together. In our case, we can use the function facet_wrap to make grouped boxplots.

Let us make a grouped boxplot such that we have boxplots of lifeExp vs continent for every year. To do that, we start with aesthetics x=continent, and y=lifeExp and make a boxplot with jittered data points, then we add facet_wrap as layer with year as its argument.

gapminder %>% 
  ggplot(aes(x= continent, y=lifeExp, fill=continent)) +
  geom_boxplot() +
  geom_jitter(width=0.1,alpha=0.2) +
  xlab("Continent")+ 
  facet_wrap(~year) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This will group our data by year and make boxplots for each year.

Grouped Boxplot with facet_wrap: Example 1

Let us make another grouped boxpolots with facet_wrap. However, this time let us group by select continents and make boxplots for all years for each continent.

So we will use facet_wrap with continent as arguments. Here, we can also specify the number of columns or rows we want to show.


gapminder %>% 
  filter(year %in% c(1952,1962,1972,1982,1992,2002)) %>%
  filter(continent != 'Oceania') %>%
  ggplot(aes(x=factor(year),y=lifeExp, fill=continent)) +
  geom_boxplot() +
  geom_jitter(width=0.1,alpha=0.2) +
  xlab("Year")+ 
  facet_wrap(~continent,ncol = 4) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
Grouped Boxplot with facet_wrap: Example 2