Boxplots are great to visualize distributions of multiple variables. ggplot2 is great to make beautiful boxplots really quickly. Sometimes, you may have multiple sub-groups for a variable of interest. In those situation, it is very useful to visualize using “grouped boxplots”. In R, ggplot2 package offers multiple options to visualize such grouped boxplots.
Let us load tidyverse and gapminder data package.
library(tidyverse) library(gapminder)
Let us have a peek at the gapminder data set. The gapminder data frame has six columns or variables, therefore it is ideal to illustrate grouped boxplots with ggplot2.
print(head(gapminder, n=3)) country . year . pop . continent . lifeExp . gdpPercap <fctr> . <int> . <dbl> . <fctr> . <dbl> . <dbl> 1 Afghanistan 1952 8425333 Asia 28.801 779.4453 2 Afghanistan 1957 9240934 Asia 30.332 820.8530 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
Let us first make a simple boxplot showing the actual data with jitter.
gapminder %>% ggplot(aes(x=continent,y=lifeExp, fill=continent)) + geom_boxplot() + geom_jitter(width=0.1,alpha=0.2)
Note that we specify x-axis and y-axis variables in the aesthetics. In addition, we also specify “fill=continent” to color out boxplots by continent. Then we ad two layers of geom, geom_boxplot for showing the boxplot and geom_jitter for showing the data points with jitter.
Making Grouped Boxplot with ggplot2: First Try
Let us make grouped boxplot using the gapminder dataset with ggplot. The key idea to make a grouped boxplot is to use fill argument inside ggplot’s aesthetics.
Let us say, we want to make a grouped boxplot showing the life expectancy over multiple years for each continent. Our gapminder data frame has year variable and has data from multiple years. Let us make a simpler data frame with just data for three years, 1952,1987, and 2007. We will use filter function in dplyr to filter the data for the three years of interest and feed the resulting data frame for making a grouped boxplot.
Since we want to use year as grouping variable, we can simply specify “fill=year” in addition to the x-axis and y-axis, and make a boxplot with geom_boxplot(), as shown below. For the sake of simplicity, we just have one geom layer; geom_boxplot().
gapminder %>% filter(year %in% c(1952,1987,2007)) %>% ggplot(aes(x=continent, y=lifeExp, fill=year)) + geom_boxplot()
However, the resulting boxplot is just a simple boxplot, not a grouped boxplot as we wanted. Something is definitely wrong here. The reason is that if you look at the type of the variable “year” (see with head(gapminder)), you can see that the variable year is of “int” type. That is the reason we did not get the grouped boxplot.
Making Grouped Boxplot with ggplot2
In order to make grouped boxplot using ggplot2, the group variable should be a categorical variable not numerical. We can specify that the year is categorical variable by using factor(year) and giving that to the fill argument inside aesthetics.
gapminder %>% filter(year %in% c(1957,1987,2007)) %>% ggplot(aes(x=continent, y=lifeExp, fill=factor(year))) + geom_boxplot()
Now we have a nice grouped boxplot as we originally intended. For each continent, we have three boxplots; one for each year.
Customizing Grouped Boxplot with ggplot2
Let us customize the grouped boxplot a bit. Let us do three simple customizations.
First, note that legend of the grouped boxplot we just made still says “factor(year)”. Let us fix that by replacing it to just Year. We can change the legend to what we want by using the layer labs with fill argument, labs(fill = “Year”)
Second, let us show actual data points on the boxplot. Like before, we will jitter the data points. If we simply add geom_point() as a layer, it will the original data points, but not for every grouped boxplots. Similarly, if we use geom_jitter(), the width of the jitter is bit hard to adjust. So, we use geom_point(position=position_jitterdodge()) here.
gapminder %>% filter(year %in% c(1957,1987,2007)) %>% ggplot(aes(x=continent,y=lifeExp, fill=factor(year))) + geom_boxplot() + labs(fill = "Year") + geom_point(position=position_jitterdodge(),alpha=0.3) + theme_bw(base_size = 16)
Grouped Boxplots with facets in ggplot2
Another way to make grouped boxplot is to use facet in ggplot. facet-ing functons in ggplot2 offers general solution to split up the data by one or more variables and make plots with subsets of data together. In our case, we can use the function facet_wrap to make grouped boxplots.
Let us make a grouped boxplot such that we have boxplots of lifeExp vs continent for every year. To do that, we start with aesthetics x=continent, and y=lifeExp and make a boxplot with jittered data points, then we add facet_wrap as layer with year as its argument.
gapminder %>% ggplot(aes(x= continent, y=lifeExp, fill=continent)) + geom_boxplot() + geom_jitter(width=0.1,alpha=0.2) + xlab("Continent")+ facet_wrap(~year) + theme(axis.text.x = element_text(angle = 45, hjust = 1))
This will group our data by year and make boxplots for each year.
Let us make another grouped boxpolots with facet_wrap. However, this time let us group by select continents and make boxplots for all years for each continent.
So we will use facet_wrap with continent as arguments. Here, we can also specify the number of columns or rows we want to show.
gapminder %>% filter(year %in% c(1952,1962,1972,1982,1992,2002)) %>% filter(continent != 'Oceania') %>% ggplot(aes(x=factor(year),y=lifeExp, fill=continent)) + geom_boxplot() + geom_jitter(width=0.1,alpha=0.2) + xlab("Year")+ facet_wrap(~continent,ncol = 4) + theme(axis.text.x = element_text(angle = 45, hjust = 1))