Histograms are a great way to visualize the distributions of a single variable and it is one of the must for initial exploratory analysis with fewer variables.
In Python, one can easily make histograms in many ways. Here we will see examples of making histogram with Pandas and Seaborn.
Let us first load Pandas, pyplot from matplotlib, and Seaborn to make histograms in Python.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
We will use gapminder dataset and download it directly from software carpentry website.
data_url = 'http://bit.ly/2cLzoxH' gapminder = pd.read_csv(data_url) gapminder.head(n=3)
How To Plot Histogram with Pandas
Let us use Pandas’ hist function to make a histogram showing the distribution of life expectancy in years in our data. One of the key arguments to use while plotting histograms is the number of bins. Here it is specified with the argument ‘bins’. This basically defines the shape of histogram. One should always experiment with a couple of different “bins” while making histogram.
gapminder['lifeExp'].hist(bins=100)
Let us change the bins to 10 and see how the histogram looks like.
We can see that immediately the histogram with small number of bins does not look that great, smaller details of the distributions can easily disappear. When the number of bins are really high, one might see more patterns in the histogram.
How To Customize Histograms with Pandas?
The default histogram that Pandas make is pretty basic and it is okay for a first pass quick look at the distribution of the data. But not great for full illustration of the data.
For example, the Pandas histogram does not have any labels for x-axis and y-axis. Let us customize the histogram using Pandas.
First, let us remove the grid that we see in the histogram, using grid =False as one of the arguments to Pandas hist function. We can also specify the size of ticks on x and y-axis by specifying xlabelsize/ylabelsize.
Then let us specify our x-axis label with font size and y-axis label with fontsize. We can also specify what is the range of x-axis that we want to show in our histogram. For customizing these options, we directly use matplotlib’s plt object as that is easier.
gapminder['lifeExp'].hist(bins=100, grid=False, xlabelsize=12, ylabelsize=12) plt.xlabel("Life Expectancy", fontsize=15) plt.ylabel("Frequency",fontsize=15) plt.xlim([22.0,90.0])
Now the histogram above is much better with easily readable labels.
Sometimes, we may want to display our histogram in log-scale, Let us see how can make our x-axis as log-scale. We can use matplotlib’s plt object and specify the the scale of x-axis using “xscale=’log’ function.
gapminder['gdpPercap'].hist(bins=1000,grid=False) plt.xlabel("gdpPercap", fontsize=15) plt.ylabel("Frequency",fontsize=15) plt.xscale('log')
How To Make Histogram with Seaborn in Python?
The plotting library Seaborn has built-in function to make histogram. The Seaborn function to make histogram is “distplot” for distribution plot. As usual, Seaborn’s distplot can take the column from Pandas dataframe as argument to make histogram.
sns.distplot(gapminder['lifeExp'])
By default, the histogram from Seaborn has multiple elements built right into it. Seaborn can infer the x-axis label and its ranges. It automatically chooses a bin size to make the histogram. Seaborn plots density curve in addition to a histogram.
Let us customize the histogram from Seaborn. Seaborn’s distplot function has a lot of options to choose from and customize our histogram.
Let us first remove the density line that Seaborn plots automatically, change the color, and then increase the number of bins. We can use Seaborn distplot’s argument ‘kde=False’ to remove the density line on the histogram, ‘color=’red’ argument to change the color of the histogram and then use bins=100 to increase the number of bins. Then we get the following plot.
sns.distplot(gapminder['lifeExp'], kde=False, color='red', bins=100)
Let us use matplotlib’s pyplot plt object to make more customization. Let us set x-axis label and size, y-axis label and size and title and size. We can use plt’s xlabel, ylabel and title with fontsize argument to make the customization as follows
sns.distplot(gapminder['lifeExp'], kde=False, color='red', bins=100) plt.title('Life Expectancy', fontsize=18) plt.xlabel('Life Exp (years)', fontsize=16) plt.ylabel('Frequency', fontsize=16)
And now the histogram would like this and it is way better than the first one we made.
How To Multiple Histograms with Seaborn in Python?
So far, we visualized just a single variable as histogram. Sometimes, we would like to visualize the distribution of multiple of variables as multiple histograms or density plots. Let us use Seaborn’s distplot to make histograms of multiple variables/distributions. Visualizing multiple variables as histograms may be useful as long as the number of distributions is not really large.
Let us start with two variables and visualize as histograms first. Let us use our gapminder data and make histograms for the variable.
The basic idea to use while plotting multiple histograms is to first make histogram of one variable first and then add the next histogram to the existing plot object.
In this example, we plot histogram of life expectancy for two continents, Africa and Americas. To do that we first subset the original data frame for Africa and make a histogram with distplot.
df = gapminder[gapminder.continent == 'Africa'] sns.distplot(df['lifeExp'], kde=False, label='Africa')
Then subset the data frame for America and make the histogram plot as an additional layer.
df =gapminder[gapminder.continent == 'Americas'] sns.distplot(df['lifeExp'], kde=False,label='Americas')
Then we can use the plt object to customize our histogram’s labels like before.
# Plot formatting plt.legend(prop={'size': 12}) plt.title('Life Expectancy of Two Continents') plt.xlabel('Life Exp (years)') plt.ylabel('Density')
How To Multiple Density Curves with Seaborn in Python?
Sometimes simply plotting the density curve is more useful than the actual histograms. We can make density curves like above, but with “hist = False” argument to Seaborn’s distplot.
df = gapminder[gapminder.continent == 'Africa'] sns.distplot(df['lifeExp'], hist = False, kde = True, label='Africa') df = gapminder[gapminder.continent == 'Americas'] sns.distplot(df['lifeExp'], hist = False, kde = True, label='Americas') # Plot formatting plt.legend(prop={'size': 12}) plt.title('Life Expectancy vs Continents') plt.xlabel('Life Exp (years)') plt.ylabel('Density')