Pearson and Spearman Correlation in Python

Understanding relationship between two or more variables is at the core of many aspects of data analysis or statistical analysis. Correlation or correlation coefficient captures the association between two variables (in the simplest case), numerically.

One of the commonly used correlation measures is Pearson correlation coefficient. Another commonly used correlation measure is Spearman correlation coefficient.

In this post, we will see examples of computing both Pearson and Spearman correlation in Python first using Pandas, Scikit Learn and NumPy.

We will use gapminder data and compute correlation between gdpPercap and life expectancy values from multiple countries over time. In this case, we would expect that life expectancy would increase as country’s GDP per capita increases.

Let us find that out how to compute Pearson and spearman correlation in Python. Let us first load the packages needed

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Let us load gapminder data as Pandas data frame.

data_url = 'http://bit.ly/2cLzoxH'
# read data from url as pandas dataframe
gapminder = pd.read_csv(data_url)
# let us select two relevant columns
gapminder = gapminder[['gdpPercap', 'lifeExp']]
print(gapminder.head(3))

    gdpPercap  lifeExp
0  779.445314   28.801
1  820.853030   30.332
2  853.100710   31.997

Pearson Correlation

Pearson correlation quantifies the linear relationship between two variables. Pearson correlation coefficient can lie between -1 and +1, like other correlation measures. A positive Pearson corelation mean that one variable’s value increases with the others. And a negative Pearson coefficient means one variable decreases as other variable decreases. Correlations coefficients of -1 or +1 mean the relationship is exactly linear.

Pearson Correlation with Pandas

Pandas offers corr() function that we can use it with Pandas series as shown below. We can see that gdpPercap and lifeExp is positively correlated showing the an increase in gdpPercap increases life expectancy over all.

gapminder.gdpPercap.corr(gapminder.lifeExp, method="pearson")

0.5837062198659948

Pearson Correlation with NumPy

We can also use NumPy to compute Pearson correlation coefficient. NumPy’s corrcoef() function can take multiple variables as 2D NumPy array and return correlation matrix.

np.corrcoef(gapminder.gdpPercap, gapminder.lifeExp)

In the simplest case with two variables it returns a 2×2 matrix with Pearson correlation values.

array([[1.        , 0.58370622],
       [0.58370622, 1.        ]])

Pearson Correlation with SciPy

We can also compute Pearson correlation coefficient using SciPy’s stats module.

from scipy import stats
gdpPercap = gapminder.gdpPercap.values
life_exp = gapminder.lifeExp.values

SciPy’s stats module has a function called pearsonr() that can take two NumPy arrays and return a tuple containing Pearson correlation coefficient and the significance of the correlation as p-value.

stats.pearsonr(gdpPercap,life_exp)

The first element of tuple is the Pearson correlation and the second is p-value.

(0.5837062198659948, 3.565724241051659e-156)

Spearman Correlation

Pearson correlation assumes that the data we are comparing is normally distributed. When that assumption is not true, the correlation value is reflecting the true association. Spearman correlation does not assume that data is from a specific distribution, so it is a non-parametric correlation measure. Spearman correlation is also known as Spearman’s rank correlation as it computes correlation coefficient on rank values of the data.

Spearman Correlation with Pandas

We can the corr() function with parameter method=”spearman” to compute spearman correlation using Pandas.

gapminder.gdpPercap.corr(gapminder.lifeExp, method="spearman")

We can see that Spearman correlation is higher than Pearson correlation

0.8264711811970715

Spearman Correlation with NumPy

NumPy does not have a specific function for computing Spearman correlation. However, we can use a definition of Spearman correlation, which is correlation of rank values of the variables. We basically compute rank of the two variables and use the ranks with Pearson correlation function available in NumPy.

gapminder["gdpPercap_r"] = gapminder.gdpPercap.rank()
gapminder["lifeExp_r"] = gapminder.lifeExp.rank()
gapminder.head()

In this example, we created two new variables that ranks of the original variables and use it with NumPy's corrcoef() function

np.corrcoef(gapminder.gdpPercap_r, gapminder.lifeExp_r)

As we saw before, this returns a correlation matrix for all variables. And note the Spearman correlation results from NumPy matches with athat from Pandas.

array([[1.        , 0.82647118],
       [0.82647118, 1.        ]])

Spearman Correlation with SciPy

Using SciPy, we can compute Spearman correlation using the function spearmanr() and we will get the same result as above.

stats.spearmanr(gdpPercap,life_exp)

Understanding the Difference Between Pearson and Spearman Correlation

The first thing that strikes when comparing correlation coefficients between gdpPercap and lifeExp computed by Pearson and Spearman correlation coefficients is the big difference between them. Why are they different? We can understand the difference, if we understand the assumption of each method.

As mentioned before, Pearson correlation assumes the data is normally distributed. However, Spearman does not make any assumption on the distribution of the data. That is the main reason for the difference.

Let us check if the variables are normally distributed. We can visualize the distributions using histogram. Let us make histogram of life expectancy values from gapminder data.

hplot = sns.distplot(gapminder['lifeExp'], kde=False, color='blue', bins=100)
plt.title('Life Expectancy', fontsize=18)
plt.xlabel('Life Exp (years)', fontsize=16)
plt.ylabel('Frequency', fontsize=16)
plot_file_name="gapminder_life_expectancy_histogram.jpg"
# save as jpeg
hplot.figure.savefig(plot_file_name,
                    format='jpeg',
                    dpi=100)

Here is the distribution of life expectancy and we can clearly see that it is not normally distributed. Not shown here, but the distribution of gdPercap is not normally distributed. Therefore, the Pearson correlation coefficient assumption is clearly violated and can explain the difference we see.

Distribution of Life Expectancy values from gapminder data

And in addition, Pearson correlation captures the strength of linear relationship between two variables. However, Spearman rank correlation can capture non-linear association as well. If we look at the scatterplot of the relationship between gdpPercap and lifeExp, we can see that the relationship is not linear. And this can explain the difference as well.

sns.scatterplot('lifeExp','gdpPercap',data=gapminder)
plt.title('lifeExp vs gdpPercap', fontsize=18)
plt.ylabel('gdpPercap', fontsize=16)
plt.xlabel('lifeExp', fontsize=16)

Non-linear relationship between gdpPercap and lifeExp