Understanding relationship between two or more variables is at the core of many aspects of data analysis or statistical analysis. Correlation or correlation coefficient captures the association between two variables (in the simplest case), numerically.
One of the commonly used correlation measures is Pearson correlation coefficient. Another commonly used correlation measure is Spearman correlation coefficient.
In this post, we will see examples of computing both Pearson and Spearman correlation in Python first using Pandas, Scikit Learn and NumPy.
We will use gapminder data and compute correlation between gdpPercap and life expectancy values from multiple countries over time. In this case, we would expect that life expectancy would increase as country’s GDP per capita increases.
Let us find that out how to compute Pearson and spearman correlation in Python. Let us first load the packages needed
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline
Let us load gapminder data as Pandas data frame.
data_url = 'http://bit.ly/2cLzoxH' # read data from url as pandas dataframe gapminder = pd.read_csv(data_url) # let us select two relevant columns gapminder = gapminder[['gdpPercap', 'lifeExp']] print(gapminder.head(3)) gdpPercap lifeExp 0 779.445314 28.801 1 820.853030 30.332 2 853.100710 31.997
Pearson Correlation
Pearson correlation quantifies the linear relationship between two variables. Pearson correlation coefficient can lie between -1 and +1, like other correlation measures. A positive Pearson corelation mean that one variable’s value increases with the others. And a negative Pearson coefficient means one variable decreases as other variable decreases. Correlations coefficients of -1 or +1 mean the relationship is exactly linear.
Pearson Correlation with Pandas
Pandas offers corr() function that we can use it with Pandas series as shown below. We can see that gdpPercap and lifeExp is positively correlated showing the an increase in gdpPercap increases life expectancy over all.
gapminder.gdpPercap.corr(gapminder.lifeExp, method="pearson") 0.5837062198659948
Pearson Correlation with NumPy
We can also use NumPy to compute Pearson correlation coefficient. NumPy’s corrcoef() function can take multiple variables as 2D NumPy array and return correlation matrix.
np.corrcoef(gapminder.gdpPercap, gapminder.lifeExp)
In the simplest case with two variables it returns a 2×2 matrix with Pearson correlation values.
array([[1. , 0.58370622], [0.58370622, 1. ]])
Pearson Correlation with SciPy
We can also compute Pearson correlation coefficient using SciPy’s stats module.
from scipy import stats gdpPercap = gapminder.gdpPercap.values life_exp = gapminder.lifeExp.values
SciPy’s stats module has a function called pearsonr() that can take two NumPy arrays and return a tuple containing Pearson correlation coefficient and the significance of the correlation as p-value.
stats.pearsonr(gdpPercap,life_exp)
The first element of tuple is the Pearson correlation and the second is p-value.
(0.5837062198659948, 3.565724241051659e-156)
Spearman Correlation
Pearson correlation assumes that the data we are comparing is normally distributed. When that assumption is not true, the correlation value is reflecting the true association. Spearman correlation does not assume that data is from a specific distribution, so it is a non-parametric correlation measure. Spearman correlation is also known as Spearman’s rank correlation as it computes correlation coefficient on rank values of the data.
Spearman Correlation with Pandas
We can the corr() function with parameter method=”spearman” to compute spearman correlation using Pandas.
gapminder.gdpPercap.corr(gapminder.lifeExp, method="spearman")
We can see that Spearman correlation is higher than Pearson correlation
0.8264711811970715
Spearman Correlation with NumPy
NumPy does not have a specific function for computing Spearman correlation. However, we can use a definition of Spearman correlation, which is correlation of rank values of the variables. We basically compute rank of the two variables and use the ranks with Pearson correlation function available in NumPy.
gapminder["gdpPercap_r"] = gapminder.gdpPercap.rank() gapminder["lifeExp_r"] = gapminder.lifeExp.rank() gapminder.head()
In this example, we created two new variables that ranks of the original variables and use it with NumPy's corrcoef() function
np.corrcoef(gapminder.gdpPercap_r, gapminder.lifeExp_r)
As we saw before, this returns a correlation matrix for all variables. And note the Spearman correlation results from NumPy matches with athat from Pandas.
array([[1. , 0.82647118], [0.82647118, 1. ]])
Spearman Correlation with SciPy
Using SciPy, we can compute Spearman correlation using the function spearmanr() and we will get the same result as above.
stats.spearmanr(gdpPercap,life_exp)
Understanding the Difference Between Pearson and Spearman Correlation
The first thing that strikes when comparing correlation coefficients between gdpPercap and lifeExp computed by Pearson and Spearman correlation coefficients is the big difference between them. Why are they different? We can understand the difference, if we understand the assumption of each method.
As mentioned before, Pearson correlation assumes the data is normally distributed. However, Spearman does not make any assumption on the distribution of the data. That is the main reason for the difference.
Let us check if the variables are normally distributed. We can visualize the distributions using histogram. Let us make histogram of life expectancy values from gapminder data.
hplot = sns.distplot(gapminder['lifeExp'], kde=False, color='blue', bins=100) plt.title('Life Expectancy', fontsize=18) plt.xlabel('Life Exp (years)', fontsize=16) plt.ylabel('Frequency', fontsize=16) plot_file_name="gapminder_life_expectancy_histogram.jpg" # save as jpeg hplot.figure.savefig(plot_file_name, format='jpeg', dpi=100)
Here is the distribution of life expectancy and we can clearly see that it is not normally distributed. Not shown here, but the distribution of gdPercap is not normally distributed. Therefore, the Pearson correlation coefficient assumption is clearly violated and can explain the difference we see.
And in addition, Pearson correlation captures the strength of linear relationship between two variables. However, Spearman rank correlation can capture non-linear association as well. If we look at the scatterplot of the relationship between gdpPercap and lifeExp, we can see that the relationship is not linear. And this can explain the difference as well.
sns.scatterplot('lifeExp','gdpPercap',data=gapminder) plt.title('lifeExp vs gdpPercap', fontsize=18) plt.ylabel('gdpPercap', fontsize=16) plt.xlabel('lifeExp', fontsize=16)