Imputing missing values with SVD

Visualizing data with NAs as heeatmap
Visualizing data with NAs as heeatmap

An Introduction to Statistical Learning with Applications in R
Finally got my hands on to the physical copy of the amazing Introduction to Statistical Learning with Applications in R (ISLR), Second Edition by Gareth James, Daniela Witten, Trevor Hastie, Rob Tibshirani . And it felt good to donate the first version to a local library.

In this post, we learn how to use SVD, most useful dimensionality reduction techniques, to impute missing values. ISLR’s unsupervised learning chapter has this nice example of using SVD on data matrix with NAs at random to impute.

Here we use the same code from the book, but on Palmer penguins dataset instead of USArrests. data set to impute randomly introduced missing values.

Leet us get started by loading thee packages needed, the meta package tidyverse and palmerpenguin.

library(tidyverse)
library(palmerpenguins)

Penguin data has some missing values, and we remove them first.

penguins <- penguins  %>%
  drop_na()

We also use the four numerical features of Palmer Penguins dataset.

penguins_data <- penguins %>%
  select(-year) %>%
  select(where(is.numeric))

Let us scale the data and convert the data frame to a data matrix for our computation.

X <-  penguins_data %>% 
  scale() %>%
  data.matrix()
X %>% head()

     bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
[1,]     -0.8946955     0.7795590        -1.4246077  -0.5676206
[2,]     -0.8215515     0.1194043        -1.0678666  -0.5055254
[3,]     -0.6752636     0.4240910        -0.4257325  -1.1885721
[4,]     -1.3335592     1.0842457        -0.5684290  -0.9401915
[5,]     -0.8581235     1.7444004        -0.7824736  -0.6918109
[6,]     -0.9312674     0.3225288        -1.4246077  -0.7228585

SVD to the rescue

The key idea in using SVD to impute missing values is that we can reconstruct the original data using the top singular vectors (or principal components). See an example of using SVD for image reconstruction using top PC in an earlier post.

To perform SVD we need data with no NAs, so what we do is that we replace NAs with mean values of the columns, perform SVD to reconstruct the data matrix, compare mean square values between reconstructed matrix’s non missing values and the original matrix’s non missing values.

SVD gives us three matrices, d, u and v.

svd_obj <- svd(X)
names(svd_obj)

[1] "d" "u" "v"

And with the singular vectors and matrix we can reconstruct lossy original matrix.

round(svd_obj$u,3) %>% head()


       [,1]   [,2]   [,3]   [,4]
[1,] -0.061 -0.002  0.021  0.088
[2,] -0.044  0.028  0.002  0.067
[3,] -0.046  0.010 -0.017 -0.088
[4,] -0.062  0.001  0.057 -0.079
[5,] -0.064 -0.051  0.063 -0.033
[6,] -0.059  0.023 -0.003  0.084

Randomly introducing missing values

Here we follow the same strategy used in the ISLR2 book to create dataset with missing values. We start. with the. original penguin dataset with no missing values. and randomly select 100 rows to add NAs first using sample() function in R. And this gives us the row indices that will have missing values.

n_omit <- 100
set.seed(15)
ind_x <- sample(seq(nrow(X)), n_omit)
ind_x

  [1]  37 162 294 177 261 217 204  84 230 193 140 245 333 215 281 218 213 181
 [19]  19 107 286  21 165  79 283  86   2 142 186  10 201 173 209 189 289 295
 [37] 188 330   6  98 279 123  75  49 184  39  88  81 211  20  64  44 287 240
 [55] 250 312 136 303  70 179 262 300 144  11 168   7 285  77  90 147 305 133
 [73]  13 203  25 238  54 131 117 122 104 332 291 276  87 198 115 306 116  73
 [91] 139 192 101 159 275  69  80 259 155   3

Next, for those 100 rows, we select one of columns randomly for adding NAs.

ind_y <- sample(1:4, n_omit, replace=TRUE)
ind_y

  [1] 2 2 3 4 1 1 4 2 3 4 2 4 3 2 2 2 3 1 3 1 4 1 3 2 3 4 4 3 2 4 4 3 3 1 1 2 1
 [38] 3 2 2 1 3 3 2 2 2 1 1 2 2 1 1 3 3 2 4 2 4 3 1 3 2 3 3 4 1 4 1 3 1 4 2 1 3
 [75] 2 4 4 3 1 4 1 4 3 3 3 4 4 4 2 4 1 4 2 2 1 2 4 3 4 4

Both the row and column indices give us the exact element where we would like to have NAs. For example, we will have missing value at 37th row and 2nd column, by looking at both the indices as shown below.

index_na <- cbind(ind_x, ind_y)
index_na %>% head()

     ind_x ind_y
[1,]    37     2
[2,]   162     2
[3,]   294     3
[4,]   177     4
[5,]   261     1
[6,]   217     1

Since we know the locations of NA, now we actually substitute the randomly chosen indices to be NAs. Also name thee new dataset with missing values as Xna.

Xna <- X
Xna[index_na] <- NA
Xna %>% head()
     bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
[1,]     -0.8946955     0.7795590        -1.4246077  -0.5676206
[2,]     -0.8215515     0.1194043        -1.0678666          NA
[3,]     -0.6752636     0.4240910        -0.4257325          NA
[4,]     -1.3335592     1.0842457        -0.5684290  -0.9401915
[5,]     -0.8581235     1.7444004        -0.7824736  -0.6918109
[6,]     -0.9312674            NA        -1.4246077  -0.7228585

To quickly understand how much NAs we have added randomly, here we visualize the data with missing values as a heatmap. The red color on the heatmap shows NAs in the data.

Xna %>%
  as.data.frame() %>%
  mutate(row_id=row_number()) %>%
  pivot_longer(-row_id, names_to="feature", values_to="values") %>%
  ggplot(aes(x=feature,y=row_id,fill=values))+
  geom_tile()+
  scale_fill_continuous(na.value = 'red')
ggsave("Visualizing_Data_with_MissigingValues_as_heatmap.png", width=5, height=10)
Visualizing data with NAs as heatmap

Let us write a small function to reconstruct the original data using top singular vectors. The function takes in the data and number of singular vectors to use as input to reconstruct the data.

# function to reconstruct data from singular vectors
fit_svd <- function(X, M = 1) {
  svdob <- svd(X)
  with(svdob,
       u[, 1:M, drop = FALSE] %*%
         (d[1:M] * t(v[, 1:M, drop = FALSE])) )
  }

First step of the iterative algorithm to impute missing values is to replace NAs with column means and store the column-mean-replaced-matrix as our current estimate “Xhat”.

Xhat <- Xna
xbar <- colMeans(Xna, na.rm = TRUE)
xbar

   bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
     -0.002680313      -0.005811732      -0.006474528      -0.006134745 
Xhat[index_na]  <- xbar
Xhat %>% head()

    bill_length_mm bill_depth_mm flipper_length_mm  body_mass_g
[1,]     -0.8946955   0.779558953        -1.4246077 -0.567620576
[2,]     -0.8215515   0.119404276        -1.0678666 -0.006474528
[3,]     -0.6752636   0.424091050        -0.4257325 -0.006134745
[4,]     -1.3335592   1.084245726        -0.5684290 -0.940191505
[5,]     -0.8581235   1.744400403        -0.7824736 -0.691810886
[6,]     -0.9312674  -0.006474528        -1.4246077 -0.722858463

Our approach is to start with the current Xhat, reconstruct the matrix with first singular vector, estimate the mean square values of reconstructed non-NA values with Xhat. And use that to compare that of original matrix. We iterate these steps until the mean square error converges. To do this, let us set some initial values and threshold for convergence.

thresh <- 1e-7
rel_err <- 1
iter <- 0
ismiss <- is.na(Xna)
ismiss %>% head()

     bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
[1,]          FALSE         FALSE             FALSE       FALSE
[2,]          FALSE         FALSE             FALSE        TRUE
[3,]          FALSE         FALSE             FALSE        TRUE
[4,]          FALSE         FALSE             FALSE       FALSE
[5,]          FALSE         FALSE             FALSE       FALSE
[6,]          FALSE          TRUE             FALSE       FALSE

We also create a boolean matrix to easily identify the elements with missing values and elements with no NAs. We will use the boolean matrix to pull the elements of interest easily from the matrix.

Let us compute the mean squared values of non-missing values. This basically takes in all non-NA values, squares them and computes mean of squares.

mssold <- mean((scale(Xna, xbar, FALSE)[!ismiss])^2)
mssold  

[1] 0.9939627
mss0 <- mean(Xna[!ismiss]^2)
mss0

[1] 0.9939927

Our main imputation algorithm is as shown below. We use while loop to iterate the steps until the relative error is within our threshold. We first, reconstruct data matrix with the first singular vector as that captures most of the variation in the data. For this we use column mean substituted data. With the reconstructed data we get an estimate of missing values using all the other data. We save the current estimate of missing values in Xhat. Now our updated Xhat has the new estimate for missing values and for the non-NAs we have the original data. We compute the mean square values for the non missing values to compute the relative error.

We can see that by eighth iteration, it converges.

# Code for imputing random NAs with SVD
# copied from ISLR2 book 
while(rel_err > thresh){           
  iter <- iter + 1
  # reconstruct data matrix with singular vector
  Xapp <- fit_svd(Xhat, M=1)
  # Save current new estimates of missing values
  Xhat[ismiss] <- Xapp[ismiss]
  # Compute the mean square error of non missing values between original data and current data
  mss <- mean(((Xna-Xapp)[!ismiss])^2)
  rel_err <- (mssold - mss)/mss0
  mssold <- mss
  cat("Iter: ", iter, "MSS:", mss,
                    "Rel. err:", rel_err, "\n" )
}

Iter:  1 MSS: 0.3134024 Rel. err: 0.6846733 
Iter:  2 MSS: 0.3037529 Rel. err: 0.009707789 
Iter:  3 MSS: 0.302975 Rel. err: 0.0007826904 
Iter:  4 MSS: 0.3028986 Rel. err: 7.680355e-05 
Iter:  5 MSS: 0.3028897 Rel. err: 8.976388e-06 
Iter:  6 MSS: 0.3028885 Rel. err: 1.176544e-06 
Iter:  7 MSS: 0.3028884 Rel. err: 1.648001e-07 
Iter:  8 MSS: 0.3028883 Rel. err: 2.402205e-08 

Since we know the truth, we can compare the estimated NA values with the true values at NA locations. And the correlation is pretty decent.

cor(Xapp[ismiss], X[ismiss])

[1] 0.6860669

We can also visualize by plotting the imputed values with the true values with the scatter plot. And we can see the nice correlationbetween the two.

imputed_df <- tibble(imputed=Xapp[ismiss],
                     truth =  X[ismiss])
imputed_df %>%
  ggplot(aes(imputed, truth))+
  geom_point()+
  geom_smooth(method="lm")+
  theme_bw(16)+
  labs(title="Imputed values vs the truth")
ggsave("comparing_imputed_values_with_truth.png")

Comparing imputed values by SVD to the truth

To summarize, we used the algorithm from ISLR2 book to learn how to impute missing values that came through some random process by using SVD. We saw that the simple algorithm does pretty decently.