Boxplot, introduced by John Tukey in his classic book Exploratory Data Analysis close to 50 years ago, is great for visualizing data distributions from multiple groups. Boxplot captures the summary of the data efficiently with a simple box and whiskers and allows us to compare easily across groups. Boxplots summarizes a sample data using 25th, 50th and 75th percentiles. These percentiles are also known as the lower quartile, median and upper quartile. The advantage of comparing quartiles is that they are not influenced by outliers.
If you are interested in learning more about the history and evolution of boxplots, check out Hadley Wickham’s 2011 paper 40 years of Boxplots.
In this post, we will see how to make boxplots using Python’s Pandas and Seaborn. Let us first load the necessary packages needed to plot boxplots in Python.
# import pandas import pandas as pd # import matplotlib import matplotlib.pyplot as plt # import seaborn import seaborn as sns %matplotlib inline
Let us load the gapminder data to make boxplots. We will directly download the gapminder data from Software Carpentry github page. Pandas’ read_csv can easily load the data as a dataframe from a URL.
data_url = 'http://bit.ly/2cLzoxH' # read data from url as pandas dataframe gapminder = pd.read_csv(data_url) print(gapminder.head(3))
Let us filter the gapminder data such that we will keep gapminder data from all countries but only for the year 2007. We will use pandas to filter and subset the original dataframe.
gapminder_2007 = gapminder[gapminder['year']==2007] gapminder_2007.shape
We will plot boxplots in four ways, first with using Pandas’ boxplot function and then use Seaborn plotting library in three ways to get a much improved boxplot.
How to Make Boxplots with Pandas
Python’s pandas have some plotting capabilities. Once you have created a pandas dataframe, one can directly use pandas plotting option to plot things quickly. One way to plot boxplot using pandas dataframe is to use boxplot function that is part of pandas. Let us say we want to plot a boxplot of life expectancy by continent, we would use pandas like
gapminder_2007.boxplot(by='continent', column=['lifeExp'], grid=False)
The pandas boxplot looks okay for a for first pass analysis. One can clearly see the trend in the data. The key to make good visuzlization is to start with something basic, and iterate over to make it better. Let us try to use Python’s Seaborn library to make boxplots .
How to Make Boxplot with Seaborn
To make basic boxplot with Seaborn, we can use the pandas dataframe as input and use Seaborn’s boxplot function. In addition to the data, we can also specify multiple options to customize the boxplot with Seaborn. Let us choose color palette scheme for the boxplot with Seaborn. Here, we have chosen colorblind friendly palette “colorblind”. Other color palette options available in Seaborn include deep, muted, bright, pastel, and dark. Let us also specify the width of the boxes in boxplot.
bplot = sns.boxplot(y='lifeExp', x='continent', data=gapminder_2007, width=0.5, palette="colorblind")
Boxplot with data points using Seaborn
Boxplot alone is extremely useful in getting the summary of data within and between groups. However, often, it is a good practice to overlay the actual data points on the boxplot. Using Seaborn, we can do that in a few ways. One way to make boxplot with data points in Seaborn is to use stripplot available in Seaborn.
We will first use Seaborn’s boxplot like before with no data points and add a layer of data points to the boxplot with stripplot. While plotting with stripplot, we can use its multiple options to make it look better. For example we can specify what marker we can use to show the data points and it is also better to use jitter=True option to spread the data points horizontally.
# make boxplot with Seaborn bplot=sns.boxplot(y='lifeExp', x='continent', data=gapminder_2007, width=0.5, palette="colorblind") # add stripplot to boxplot with Seaborn bplot=sns.stripplot(y='lifeExp', x='continent', data=gapminder_2007, jitter=True, marker='o', alpha=0.5, color='black')
Boxplot with Swarm plot using Seaborn
Adding the data points to boxplot with stripplot using Seaborn, definitely make the boxplot look better. Another way we can visualize data points with Seaborn boxplot is to add swarmplot instead of stripplot. We will first plot boxplot with Seaborn and then add swarmplot to display the datapoints.
# plot boxplot with seaborn bplot=sns.boxplot(y='lifeExp', x='continent', data=gapminder_2007, width=0.5, palette="colorblind") # add swarmplot bplot=sns.swarmplot(y='lifeExp', x='continent', data=gapminder_2007, color='black', alpha=0.75)
Adjust x-axis and y-axis label font sizes
Now that we have made much better looking boxplots with Seaborn, we can try to improve other aspects of boxplot. One thing to notice is that the font sizes of x-axis and y-axis labels are small and may not be clearly visible. Here is how to change the fontsizes for x and y-axes labels and also a make a title for the boxplot created by Seaborn.
bplot.axes.set_title("2007: Life Expectancy Vs Continent", fontsize=16) bplot.set_xlabel("Continent", fontsize=14) bplot.set_ylabel("Life Expectancy", fontsize=14) bplot.tick_params(labelsize=10)
How to Save the Boxplot as jpg file?
Once we have made the boxplot that we like, we can easily save as a high quality image file, like jpeg file. Here is a way to save the boxplot as jpg file at a specific resolution. By changing the dpi option we can easily increase the resolution of the image.
# output file name plot_file_name="boxplot_and_swarmplot_with_seaborn.jpg" # save as jpeg bplot.figure.savefig(plot_file_name, format='jpeg', dpi=100)