An Introduction to Statistical Learning: with Applications in Python Is Here

An Introduction to Statistical Learning in R ISLR, ~~one of~~ the best books to learn statistical learning, has a cousin now, Statistical Learning in Python.

An Introduction to Statistical Learning in Python by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor has been out recently. And just like the ISL in R, the new An Introduction to Statistical Learning ISLR is freely available for download at statlearning.com/.

Imputing missing data with SVD

This post is a result of quick dip on a Sunday afternoon at the new book An Introduction to Statistical Learning in Python. Most real world datasets have missing values. When we know that the missing-ness is at random we can try to impute the missing value instead of throwing away the observation as commonly used statistical methods cannot handle missing values.

Another approach one commonly takes is to replace the missing value by the mean value of the variable. For example, if an element of Age column is missing, we can impute the missing element with mean value of all observed Age in the data. ISLR and ILSP has a better strategy using our beloved SVD that outperforms this naive approach. The image below shows the algorithm to impute missing values by SVD from the ISLP book.

Algorithm to impute Missing Values with SVD

In this post, we will implement the algorithm using SVD to impute missing values as before. ISLP’s unsupervised learning chapter has this nice example of using SVD on USArrest data with NAs introduced at random to impute. And the chapter has an exercise problem imputing with SVD on Boston housing dataset. We will address a part of exercise using Boston housing data to illustrate SVD for imputing in Python.

First, let us load the basic Python packages and modules we need.

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np

We will also load the sklearn modules next

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Quick Data summary

Boston data originally was used to be available from scikit-learn, but due to ethical concerns it was removed from sklearn. Here we are using the downloaded Boston data.

boston = pd.read_csv("Boston.tsv", sep="\t")
boston.head()

crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	lstat	medv
0	0.00632	18.0	2.31	0	0.538	6.575	65.2	4.0900	1	296	15.3	4.98	24.0
1	0.02731	0.0	7.07	0	0.469	6.421	78.9	4.9671	2	242	17.8	9.14	21.6
2	0.02729	0.0	7.07	0	0.469	7.185	61.1	4.9671	2	242	17.8	4.03	34.7
3	0.03237	0.0	2.18	0	0.458	6.998	45.8	6.0622	3	222	18.7	2.94	33.4
4	0.06905	0.0	2.18	0	0.458	7.147	54.2	6.0622	3	222	18.7	5.33	36.2

All of the columns of Boston data are numerical and let us check the summary statistics, like mean and variance of the variables.

boston.mean()

crim         3.613524
zn          11.363636
indus       11.136779
chas         0.069170
nox          0.554695
rm           6.284634
age         68.574901
dis          3.795043
rad          9.549407
tax        408.237154
ptratio     18.455534
lstat       12.653063
medv        22.532806
dtype: float64

boston.var()

crim          73.986578
zn           543.936814
indus         47.064442
chas           0.064513
nox            0.013428
rm             0.493671
age          792.358399
dis            4.434015
rad           75.816366
tax        28404.759488
ptratio        4.686989
lstat         50.994760
medv          84.586724
dtype: float64

Standardizing Boston Housing Data

Before applying SVD, we will standardize the data with sklearn’s StandardScaler() function.

scaler = StandardScaler(with_std=True, with_mean=True)
boston_scaled = scaler.fit_transform(boston)

Our scaled Boston data looks like this

boston_scaled[0:5,0:5]
array([[-0.41978194,  0.28482986, -1.2879095 , -0.27259857, -0.14421743],
       [-0.41733926, -0.48772236, -0.59338101, -0.27259857, -0.74026221],
       [-0.41734159, -0.48772236, -0.59338101, -0.27259857, -0.74026221],
       [-0.41675042, -0.48772236, -1.30687771, -0.27259857, -0.83528384],
       [-0.41248185, -0.48772236, -1.30687771, -0.27259857, -0.83528384]])

And the scaled data has zero mean and unit variance.

np.mean(boston_scaled, axis=0)

array([-1.12338772e-16,  7.89881994e-17,  2.10635198e-16, -3.51058664e-17,
       -1.96592852e-16, -1.08828186e-16, -1.47444639e-16, -8.42540793e-17,
       -1.12338772e-16,  0.00000000e+00, -4.21270397e-16, -3.08931624e-16,
       -5.19566823e-16])

np.var(boston_scaled, axis=0)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

SVD with Numpy

Now we are ready to try out the algorithm to impute missing values by SVD. Let us save the scaled data into a new variable X.

X = boston_scaled

Let us also perform SVD using Numpy’s linalg.svd() function and just check the variance explained in the data by the singular vectors from SVD.

U, D, V = np.linalg.svd(X, full_matrices=False) 

U.shape, D.shape, V.shape

((506, 13), (13,), (13, 13))

The scree plot shows that most of the variance is explained by the first singular vector or PC.

var_explained = np.round(D**2/np.sum(D**2), decimals=3)
sns.barplot(x=list(range(1,len(var_explained)+1)),
            y=var_explained, 
            color="limegreen")
plt.xlabel('SVs', fontsize=16)
plt.ylabel('Percent Variance Explained', fontsize=16)
plt.savefig('boston_housing_svd_scree_plot.png',dpi=100)

Idea behind imputing with SVD

The key idea behind using SVD to impute missing values is that SVD or PCA we can create an approximate version of the original data using the top singular vectors or the top PCs. Check out an earlier post using SVD for reconstructing image using top singular vectors/PCs from SVD.

As the iterative algorithm from ISLP shows, we will start with replacing the missing values using the column means first. And then we will use the “column mean” filled data to perform SVD to get the singular vectors. We will create an approximate version of the data using the top singular vectors. Then we will compare the mean square values of all the non-missing elements in the reconstructed matrix with that of the original data matrix.

Creating data with Missing Values

First, let us add some missing values to our Boston housing data and apply our algorithm to impute. We are making 10% of the total elements as missing values.

import math
n_omit = math.floor(X.shape[0]* (10/100))
n_omit

50

The approach we take to introduce NAs is to first randomly select row indices.

np.random.seed(15)
row_idx = np.random.choice(np.arange(X.shape[0]),
                         n_omit ,
                         replace=False)
row_idx


array([301, 262, 172, 505, 111, 373, 211, 413, 239, 244, 135,  44, 345,
       450,  67, 427, 385, 398, 338,   8, 268,  25,  77, 128,  13, 323,
       285, 363, 131, 350,  58, 147, 330, 484, 206, 362, 294, 265, 395,
         9, 253, 276, 420, 288, 473,  11, 124,  89, 347, 377])

And then randomly select column indices to introduce NAs.

# column indices for NAs
col_idx = np.random.choice(np.arange(X.shape[1]),
                         n_omit,
                         replace=True)
col_idx

array([ 0,  3,  7, 10,  0,  8,  5,  7,  8,  2,  9,  0,  1,  8,  9,  9,  3,
       10,  4,  9,  1,  5,  7,  9,  5, 11, 10, 10,  4,  3,  9,  1,  0,  7,
        7, 12,  5, 11,  7,  5,  9, 10,  9,  9, 11, 11, 10,  6,  9,  0])

Now that we know the column and row indices we want to have missing values, we can make those elements to be NAs in the original data.

# create a new data matrix with NAs
Xna = X.copy()
Xna[row_idx, col_idx] = np.nan

Now our data matrix has missing values in 10% of the locations.

# data matrix with missing value
Xna

array([[-0.41978194,  0.28482986, -1.2879095 , ..., -1.45900038,
        -1.0755623 ,  0.15968566],
       [-0.41733926, -0.48772236, -0.59338101, ..., -0.30309415,
        -0.49243937, -0.10152429],
       [-0.41734159, -0.48772236, -0.59338101, ..., -0.30309415,
        -1.2087274 ,  1.32424667],
       ...,
       [-0.41344658, -0.48772236,  0.11573841, ...,  1.17646583,
        -0.98304761,  0.14880191],
       [-0.40776407, -0.48772236,  0.11573841, ...,  1.17646583,
        -0.86530163, -0.0579893 ],
       [-0.41500016, -0.48772236,  0.11573841, ...,         nan,
        -0.66905833, -1.15724782]])

Let us quickly visualize the missingness by making a heatmap using Seaborn’s heatmap() function. Note that we create a boolean matrix using isnull() function in Pandas.

plt.figure(figsize=(6, 9))
sns.heatmap(pd.DataFrame(Xna).isnull(),
            cbar=False, 
            cmap="YlGnBu")
plt.xlabel("Columns", size=16)
plt.ylabel("Rows", size=16)
plt.savefig('Missing_values_heatmap.png')

Visualizing missing values in data as heatmap

Function to Compute Low rank Approximation of a Matrix

One of the key operations we will be relying on in the iterative algorithm to impute NAs is to create low rank approximation of the data matrix at the current iteration using top singular vectors from SVD. So, we write a function to create the low ranked version of the data. This function takes in current data matrix and the number of singular vector to use to reconstruct the datamatrix. By default we use just the top singular vector alone to create the low ranked version of the data.

# reconstruct data matrix from M singular vectors from SVD
def low_rank(X, M=1):
    # perform SVD
    U, D, V = np.linalg.svd(X) 
    # Use M singular vectors to get low rank approximation of data
    L = U[:,:M] * D[None,:M] 
    return L.dot(V[:M])

We create a new matrix Xhat, our current estimate of the datamatrix. And compute mean values of columns with missing values.

Xhat = Xna.copy()
Xbar = np.nanmean(Xhat, axis=0)
Xbar

array([ 0.00184427,  0.00120234,  0.00152462,  0.00162584, -0.0005075 ,
        0.00699013,  0.00038553,  0.00445864, -0.00579413, -0.00045058,
       -0.00246606,  0.00097514,  0.00037345])

Update elements with missing values to mean values of the non-missing values of the columns in the Xhat matrix.

Xhat[row_idx, col_idx] = Xbar[col_idx]
Xhat

array([[-0.41978194,  0.28482986, -1.2879095 , ..., -1.45900038,
        -1.0755623 ,  0.15968566],
       [-0.41733926, -0.48772236, -0.59338101, ..., -0.30309415,
        -0.49243937, -0.10152429],
       [-0.41734159, -0.48772236, -0.59338101, ..., -0.30309415,
        -1.2087274 ,  1.32424667],
       ...,
       [-0.41344658, -0.48772236,  0.11573841, ...,  1.17646583,
        -0.98304761,  0.14880191],
       [-0.40776407, -0.48772236,  0.11573841, ...,  1.17646583,
        -0.86530163, -0.0579893 ],
       [-0.41500016, -0.48772236,  0.11573841, ..., -0.00246606,
        -0.66905833, -1.15724782]])

Now, let us set up for writing the iterative algorithm. We will compute the mean sqaure of the non-missing elements of the current data matrix Xhat. And use that to compare that of original matrix.

We iterate these steps until the mean square error converges. To do this, let us set some initial values and threshold for convergence.

At the start both mss0 and mssold are the same.

thresh = 1e-10
rel_err = 1
count = 0
ismiss = np.isnan(Xna)
mssold = np.mean(Xhat[~ismiss]**2)
mss0 = np.mean(Xna[~ismiss]**2)

mss0

1.003648674500705

Let us implement the iterative part of the algorithm using the code as given in the book. While look here helps iterate the steps until the predefined threshold of relative error is met.

First, we use the function we wrote above to reconstruct data matrix with the first singular vector. Note here we can change the number of singular vectors to use to create the low rank version of the data. We use Xhat, the data matrix where used column means to impute the missing values as our starting point for the imputed values.

The reconstructed data Xapp gives us an estimate of missing values using all the other data. In the next step, we save the current estimate of missing values in Xhat.

Now our updated Xhat has the new estimate for missing values and all the original data in non-NAs locations.

Finally, we compute the mean square values for the non missing values to compute the relative error.

while rel_err > thresh:
    count += 1
    # Step 2(a)
    # reconstruct data matrix with M singular vectors
    Xapp = low_rank(Xhat, M=1)
    # Step 2(b)
    # Save current estimates of missing values from the reconstructed matrix 
    # with the original data
    Xhat[ismiss] = Xapp[ismiss]
    # Step 2(c)
    # Compute the relative error with
    # mean square error of non missing values between original data and current data
    mss = np.mean(((Xna - Xapp)[~ismiss])**2)
    rel_err = (mssold - mss) / mss0
    mssold = mss
    print("Iteration: {0}, MSS:{1:.3f}, Rel.Err {2:.2e}"
          .format(count, mss, rel_err))


Iteration: 1, MSS:0.516, Rel.Err 4.86e-01
Iteration: 2, MSS:0.516, Rel.Err 2.97e-04
Iteration: 3, MSS:0.516, Rel.Err 3.46e-06
Iteration: 4, MSS:0.516, Rel.Err 4.95e-08
Iteration: 5, MSS:0.516, Rel.Err 7.81e-10
Iteration: 6, MSS:0.516, Rel.Err 1.29e-11

Once the algorithm converges, we have the imputed values at the missing value locations. And we can compare the imputed values with the original values. We can see that the correlation between the imputed values and the original data values is pretty good.

np.corrcoef(Xapp[ismiss], X[ismiss])[0,1]

0.7183385956567256

Let us visualize the imputed values against the original values as a scatter plot with regression line using Seaborn’s regplot() function.

imp_dict = {"original":Xapp[ismiss],
           "imputed":X[ismiss]}
imputed_df = pd.DataFrame(imp_dict)
imputed_df.head()

original	imputed
0	0.116175	-0.577519
1	0.207787	-0.399808
2	-0.366252	0.086478
3	0.140073	-0.478164
4	-0.049729	-0.976795

plt.figure(figsize=(8, 6))
sns.regplot(data=imputed_df, x="original", y="imputed")
plt.xlabel("Original Values", size=16)
plt.ylabel("Imputed Values", size=16)
plt.savefig('Performance_of_imputation_by_svd_ISLP.png')

We can see the decent agreement between the two. Note that we could do this in the toy example where we had the full data. And that is not the case in real world application.