An Introduction to Statistical Learning in R ISLR, one of the best books to learn statistical learning, has a cousin now, Statistical Learning in Python.
An Introduction to Statistical Learning in Python by Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, and Jonathan Taylor has been out recently. And just like the ISL in R, the new An Introduction to Statistical Learning ISLR is freely available for download at statlearning.com/.
Imputing missing data with SVD
This post is a result of quick dip on a Sunday afternoon at the new book An Introduction to Statistical Learning in Python. Most real world datasets have missing values. When we know that the missing-ness is at random we can try to impute the missing value instead of throwing away the observation as commonly used statistical methods cannot handle missing values.
Another approach one commonly takes is to replace the missing value by the mean value of the variable. For example, if an element of Age column is missing, we can impute the missing element with mean value of all observed Age in the data. ISLR and ILSP has a better strategy using our beloved SVD that outperforms this naive approach. The image below shows the algorithm to impute missing values by SVD from the ISLP book.
In this post, we will implement the algorithm using SVD to impute missing values as before. ISLP’s unsupervised learning chapter has this nice example of using SVD on USArrest data with NAs introduced at random to impute. And the chapter has an exercise problem imputing with SVD on Boston housing dataset. We will address a part of exercise using Boston housing data to illustrate SVD for imputing in Python.
First, let us load the basic Python packages and modules we need.
import matplotlib.pyplot as plt import seaborn as sns import pandas as pd import numpy as np
We will also load the sklearn modules next
from sklearn.decomposition import PCA from sklearn.preprocessing import StandardScaler
Quick Data summary
Boston data originally was used to be available from scikit-learn, but due to ethical concerns it was removed from sklearn. Here we are using the downloaded Boston data.
boston = pd.read_csv("Boston.tsv", sep="\t") boston.head() crim zn indus chas nox rm age dis rad tax ptratio lstat medv 0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 4.98 24.0 1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 9.14 21.6 2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 4.03 34.7 3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 2.94 33.4 4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 5.33 36.2
All of the columns of Boston data are numerical and let us check the summary statistics, like mean and variance of the variables.
boston.mean() crim 3.613524 zn 11.363636 indus 11.136779 chas 0.069170 nox 0.554695 rm 6.284634 age 68.574901 dis 3.795043 rad 9.549407 tax 408.237154 ptratio 18.455534 lstat 12.653063 medv 22.532806 dtype: float64
boston.var() crim 73.986578 zn 543.936814 indus 47.064442 chas 0.064513 nox 0.013428 rm 0.493671 age 792.358399 dis 4.434015 rad 75.816366 tax 28404.759488 ptratio 4.686989 lstat 50.994760 medv 84.586724 dtype: float64
Standardizing Boston Housing Data
Before applying SVD, we will standardize the data with sklearn’s StandardScaler() function.
scaler = StandardScaler(with_std=True, with_mean=True) boston_scaled = scaler.fit_transform(boston)
Our scaled Boston data looks like this
boston_scaled[0:5,0:5] array([[-0.41978194, 0.28482986, -1.2879095 , -0.27259857, -0.14421743], [-0.41733926, -0.48772236, -0.59338101, -0.27259857, -0.74026221], [-0.41734159, -0.48772236, -0.59338101, -0.27259857, -0.74026221], [-0.41675042, -0.48772236, -1.30687771, -0.27259857, -0.83528384], [-0.41248185, -0.48772236, -1.30687771, -0.27259857, -0.83528384]])
And the scaled data has zero mean and unit variance.
np.mean(boston_scaled, axis=0) array([-1.12338772e-16, 7.89881994e-17, 2.10635198e-16, -3.51058664e-17, -1.96592852e-16, -1.08828186e-16, -1.47444639e-16, -8.42540793e-17, -1.12338772e-16, 0.00000000e+00, -4.21270397e-16, -3.08931624e-16, -5.19566823e-16])
np.var(boston_scaled, axis=0) array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
SVD with Numpy
Now we are ready to try out the algorithm to impute missing values by SVD. Let us save the scaled data into a new variable X.
X = boston_scaled
Let us also perform SVD using Numpy’s linalg.svd() function and just check the variance explained in the data by the singular vectors from SVD.
U, D, V = np.linalg.svd(X, full_matrices=False) U.shape, D.shape, V.shape ((506, 13), (13,), (13, 13))
The scree plot shows that most of the variance is explained by the first singular vector or PC.
var_explained = np.round(D**2/np.sum(D**2), decimals=3) sns.barplot(x=list(range(1,len(var_explained)+1)), y=var_explained, color="limegreen") plt.xlabel('SVs', fontsize=16) plt.ylabel('Percent Variance Explained', fontsize=16) plt.savefig('boston_housing_svd_scree_plot.png',dpi=100)
Idea behind imputing with SVD
The key idea behind using SVD to impute missing values is that SVD or PCA we can create an approximate version of the original data using the top singular vectors or the top PCs. Check out an earlier post using SVD for reconstructing image using top singular vectors/PCs from SVD.
As the iterative algorithm from ISLP shows, we will start with replacing the missing values using the column means first. And then we will use the “column mean” filled data to perform SVD to get the singular vectors. We will create an approximate version of the data using the top singular vectors. Then we will compare the mean square values of all the non-missing elements in the reconstructed matrix with that of the original data matrix.
Creating data with Missing Values
First, let us add some missing values to our Boston housing data and apply our algorithm to impute. We are making 10% of the total elements as missing values.
import math n_omit = math.floor(X.shape[0]* (10/100)) n_omit 50
The approach we take to introduce NAs is to first randomly select row indices.
np.random.seed(15) row_idx = np.random.choice(np.arange(X.shape[0]), n_omit , replace=False) row_idx array([301, 262, 172, 505, 111, 373, 211, 413, 239, 244, 135, 44, 345, 450, 67, 427, 385, 398, 338, 8, 268, 25, 77, 128, 13, 323, 285, 363, 131, 350, 58, 147, 330, 484, 206, 362, 294, 265, 395, 9, 253, 276, 420, 288, 473, 11, 124, 89, 347, 377])
And then randomly select column indices to introduce NAs.
# column indices for NAs col_idx = np.random.choice(np.arange(X.shape[1]), n_omit, replace=True) col_idx array([ 0, 3, 7, 10, 0, 8, 5, 7, 8, 2, 9, 0, 1, 8, 9, 9, 3, 10, 4, 9, 1, 5, 7, 9, 5, 11, 10, 10, 4, 3, 9, 1, 0, 7, 7, 12, 5, 11, 7, 5, 9, 10, 9, 9, 11, 11, 10, 6, 9, 0])
Now that we know the column and row indices we want to have missing values, we can make those elements to be NAs in the original data.
# create a new data matrix with NAs Xna = X.copy() Xna[row_idx, col_idx] = np.nan
Now our data matrix has missing values in 10% of the locations.
# data matrix with missing value Xna array([[-0.41978194, 0.28482986, -1.2879095 , ..., -1.45900038, -1.0755623 , 0.15968566], [-0.41733926, -0.48772236, -0.59338101, ..., -0.30309415, -0.49243937, -0.10152429], [-0.41734159, -0.48772236, -0.59338101, ..., -0.30309415, -1.2087274 , 1.32424667], ..., [-0.41344658, -0.48772236, 0.11573841, ..., 1.17646583, -0.98304761, 0.14880191], [-0.40776407, -0.48772236, 0.11573841, ..., 1.17646583, -0.86530163, -0.0579893 ], [-0.41500016, -0.48772236, 0.11573841, ..., nan, -0.66905833, -1.15724782]])
Let us quickly visualize the missingness by making a heatmap using Seaborn’s heatmap() function. Note that we create a boolean matrix using isnull() function in Pandas.
plt.figure(figsize=(6, 9)) sns.heatmap(pd.DataFrame(Xna).isnull(), cbar=False, cmap="YlGnBu") plt.xlabel("Columns", size=16) plt.ylabel("Rows", size=16) plt.savefig('Missing_values_heatmap.png')
Function to Compute Low rank Approximation of a Matrix
One of the key operations we will be relying on in the iterative algorithm to impute NAs is to create low rank approximation of the data matrix at the current iteration using top singular vectors from SVD. So, we write a function to create the low ranked version of the data. This function takes in current data matrix and the number of singular vector to use to reconstruct the datamatrix. By default we use just the top singular vector alone to create the low ranked version of the data.
# reconstruct data matrix from M singular vectors from SVD def low_rank(X, M=1): # perform SVD U, D, V = np.linalg.svd(X) # Use M singular vectors to get low rank approximation of data L = U[:,:M] * D[None,:M] return L.dot(V[:M])
We create a new matrix Xhat, our current estimate of the datamatrix. And compute mean values of columns with missing values.
Xhat = Xna.copy() Xbar = np.nanmean(Xhat, axis=0) Xbar array([ 0.00184427, 0.00120234, 0.00152462, 0.00162584, -0.0005075 , 0.00699013, 0.00038553, 0.00445864, -0.00579413, -0.00045058, -0.00246606, 0.00097514, 0.00037345])
Update elements with missing values to mean values of the non-missing values of the columns in the Xhat matrix.
Xhat[row_idx, col_idx] = Xbar[col_idx] Xhat array([[-0.41978194, 0.28482986, -1.2879095 , ..., -1.45900038, -1.0755623 , 0.15968566], [-0.41733926, -0.48772236, -0.59338101, ..., -0.30309415, -0.49243937, -0.10152429], [-0.41734159, -0.48772236, -0.59338101, ..., -0.30309415, -1.2087274 , 1.32424667], ..., [-0.41344658, -0.48772236, 0.11573841, ..., 1.17646583, -0.98304761, 0.14880191], [-0.40776407, -0.48772236, 0.11573841, ..., 1.17646583, -0.86530163, -0.0579893 ], [-0.41500016, -0.48772236, 0.11573841, ..., -0.00246606, -0.66905833, -1.15724782]])
Now, let us set up for writing the iterative algorithm. We will compute the mean sqaure of the non-missing elements of the current data matrix Xhat. And use that to compare that of original matrix.
We iterate these steps until the mean square error converges. To do this, let us set some initial values and threshold for convergence.
At the start both mss0 and mssold are the same.
thresh = 1e-10 rel_err = 1 count = 0 ismiss = np.isnan(Xna) mssold = np.mean(Xhat[~ismiss]**2) mss0 = np.mean(Xna[~ismiss]**2) mss0 1.003648674500705
Let us implement the iterative part of the algorithm using the code as given in the book. While look here helps iterate the steps until the predefined threshold of relative error is met.
First, we use the function we wrote above to reconstruct data matrix with the first singular vector. Note here we can change the number of singular vectors to use to create the low rank version of the data. We use Xhat, the data matrix where used column means to impute the missing values as our starting point for the imputed values.
The reconstructed data Xapp gives us an estimate of missing values using all the other data. In the next step, we save the current estimate of missing values in Xhat.
Now our updated Xhat has the new estimate for missing values and all the original data in non-NAs locations.
Finally, we compute the mean square values for the non missing values to compute the relative error.
while rel_err > thresh: count += 1 # Step 2(a) # reconstruct data matrix with M singular vectors Xapp = low_rank(Xhat, M=1) # Step 2(b) # Save current estimates of missing values from the reconstructed matrix # with the original data Xhat[ismiss] = Xapp[ismiss] # Step 2(c) # Compute the relative error with # mean square error of non missing values between original data and current data mss = np.mean(((Xna - Xapp)[~ismiss])**2) rel_err = (mssold - mss) / mss0 mssold = mss print("Iteration: {0}, MSS:{1:.3f}, Rel.Err {2:.2e}" .format(count, mss, rel_err))
Iteration: 1, MSS:0.516, Rel.Err 4.86e-01 Iteration: 2, MSS:0.516, Rel.Err 2.97e-04 Iteration: 3, MSS:0.516, Rel.Err 3.46e-06 Iteration: 4, MSS:0.516, Rel.Err 4.95e-08 Iteration: 5, MSS:0.516, Rel.Err 7.81e-10 Iteration: 6, MSS:0.516, Rel.Err 1.29e-11
Once the algorithm converges, we have the imputed values at the missing value locations. And we can compare the imputed values with the original values. We can see that the correlation between the imputed values and the original data values is pretty good.
np.corrcoef(Xapp[ismiss], X[ismiss])[0,1] 0.7183385956567256
Let us visualize the imputed values against the original values as a scatter plot with regression line using Seaborn’s regplot() function.
imp_dict = {"original":Xapp[ismiss], "imputed":X[ismiss]} imputed_df = pd.DataFrame(imp_dict) imputed_df.head() original imputed 0 0.116175 -0.577519 1 0.207787 -0.399808 2 -0.366252 0.086478 3 0.140073 -0.478164 4 -0.049729 -0.976795
plt.figure(figsize=(8, 6)) sns.regplot(data=imputed_df, x="original", y="imputed") plt.xlabel("Original Values", size=16) plt.ylabel("Imputed Values", size=16) plt.savefig('Performance_of_imputation_by_svd_ISLP.png')
We can see the decent agreement between the two. Note that we could do this in the toy example where we had the full data. And that is not the case in real world application.
And can’t wait to read more interesting chapters from ISLP and write a few more posts.