Computing Correlation with Numpy corrcoef()

In this post, we will learn how to use Numpy’s corrcoef() function to compute correlation between two datasets stored in a list or arrays. Numpy’s corrcoef function calculates pearson correlation coefficient, which is a measure of how two variables are related.

The resulting correlation coefficient can range from 1 to -1. A correlation coefficient of 1 indicates a strong positive relationship (meaning that as one variable increases, the other also increases), while a correlation coefficient of -1 indicates a strong negative relationship (meaning that as one variable increases, the other decreases). A correlation coefficient of 0 indicates no relationship between the two variables.

How to compute correlation between two variables in Numpy

To use the corrcoef() function, we pass in two sets of data as arguments. The function will return a symmetric matrix of correlation coefficients, with the diagonal elements being 1 (since each variable is perfectly correlated with itself). The off-diagonal elements represent the correlation between the two variables. The correlation coefficients on the upper triangular elements will be the same lower triangular elements.

Let us consider a simple example to compute correlation using corrcoef(). We have two lists of numbers.

x = [1, 2, 3, 4, 5]
y = [5, 4, 3, 2, 1]

Let compute correlation coefficients with corrcoef() function.

correlation = np.corrcoef(x, y)
print(correlation)
[[ 1. -1.]
 [-1.  1.]]

As you can see, the corrcoef() function returns a matrix of correlation coefficients, with each element representing the correlation between the two variables.

The first element in the matrix, correlation[0][0], is the correlation between x and x, which is always 1. The second element, correlation[0][1], is the correlation between x and y, which in this case is -1 because the two sets of data are negatively correlated.

Computing correlation on 2D array with Numpy corrcoef

Numpy’s corrcoef can compute correlation on a matrix or 2d Numpy array. Here we just need to give the Numpy 2d array as input argument and we get correlation matrix as output.

Let us simulate some data in 2d Numpy array. Here is the 2D array containing random integers.

x = np.random.randint(20, size=(3,5))
x

array([[16, 17,  7,  9, 16],
       [ 3,  7,  6,  9, 10],
       [15, 12,  7, 11,  5]])

Here we compute correlation coefficient matrix for all pairs of rows in the 2D array using Numpy corrcoef.

np.corrcoef(x)

Since we have 3 rows, we get correlation matrix of dim 3×3 with self correlation along the diagonal.

array([[ 1.        , -0.0984374 ,  0.29654013],
       [-0.0984374 ,  1.        , -0.6846532 ],
       [ 0.29654013, -0.6846532 ,  1.        ]])

Computing column wise correlation with Numpy corrcoef

Another argument of interest to Numpy’s corrcoef is rowvar. By default rowvar is set True (default). Then each row represents a variable, with observations in the columns. However, when rowvar is False, each column represents a variable, while the rows contain observations.

np.corrcoef(x, rowvar=False)

Therefore Numpy’s correcoef fucntion will compute correlation for each column. For the example data we get 5×5 correlation matrix as there are 5 columns.

array([[ 1.        ,  0.89851257,  0.99760861,  0.43894779,  0.12131025],
       [ 0.89851257,  1.        ,  0.8660254 ,  0.        ,  0.54470478],
       [ 0.99760861,  0.8660254 ,  1.        ,  0.5       ,  0.05241424],
       [ 0.43894779,  0.        ,  0.5       ,  1.        , -0.83862787],
       [ 0.12131025,  0.54470478,  0.05241424, -0.83862787,  1.        ]])