Just a quick rstat post on a simple imputation approach here for the future self. SVD/PCA is one of the first things I do for analyzing any new high dimensional data. Often such data are messy and have some missing values. Depending on the situation, I often resort to removing the rows with missing data or replace NAs with mean values before doing SVD/PCA.
Here is the #rstats tip to quickly replace all NAs in a dataframe with column means or row means using tidyverse functions.
Let us get started by loading the whole of tidyverse.
library(tidyverse)
Creating Data with NAs
First, let us create a toy dataset with some missing values using sample() functin in R. We store the data as a tibble.
set.seed(49) data_df <- tibble(a = sample(c(1:3,NA), 5, replace = TRUE), b = sample(c(1:3,NA), 5, replace = TRUE), c = sample(c(1:3,NA), 5, replace = TRUE))
Our toy dataframe with missing values (NAs) has 3 columns and 5 rows.
data_df ## # A tibble: 5 × 3 ## a b c ## <int> <int> <int> ## 1 3 3 NA ## 2 1 NA 3 ## 3 3 2 1 ## 4 1 3 NA ## 5 3 3 1
Let us also add add a unique row ID to the dataframe using row_number() function.
data_df <- data_df %>% mutate(row_id = row_number())
data_df ## # A tibble: 5 × 4 ## a b c row_id ## <int> <int> <int> <int> ## 1 3 3 NA 1 ## 2 1 NA 3 2 ## 3 3 2 1 3 ## 4 1 3 NA 4 ## 5 3 3 1 5
Replace NAs with column means in tidyverse
A simple way to replace NAs with column means is to use group_by() on the column names and compute means for each column and use the mean column value to replace where the element has NA.
To start with, let us convert the dataframe/tibble into tidy dataframe by keeping the row ID in one column, column names in another column and the actual values in the third column.
data_df %>% pivot_longer(-row_id, names_to="column_id", values_to="val") ## # A tibble: 15 × 3 ## row_id column_id val ## <int> <chr> <int> ## 1 1 a 3 ## 2 1 b 3 ## 3 1 c NA ## 4 2 a 1 ## 5 2 b NA ## 6 2 c 3 ## 7 3 a 3 ## 8 3 b 2 ## 9 3 c 1 ## 10 4 a 1 ## 11 4 b 3 ## 12 4 c NA ## 13 5 a 3 ## 14 5 b 3 ## 15 5 c 1
Now we can group by the column names, and update the value column with column mean if there is NA. We use mutate function to update the current values and use ifelse() to check if the element is NA or not. If it is NA, we replace with that group’s mean value. Note that we use na.rm=TRUE while using mean function.
data_df %>% pivot_longer(-row_id, names_to="column_id", values_to="val") %>% group_by(column_id) %>% mutate(val= ifelse(is.na(val), mean(val, na.rm = T), val))
Now we have all NAs replaced by column means in long tidy form.
## # A tibble: 15 × 3 ## # Groups: column_id [3] ## row_id column_id val ## <int> <chr> <dbl> ## 1 1 a 3 ## 2 1 b 3 ## 3 1 c 1.67 ## 4 2 a 1 ## 5 2 b 2.75 ## 6 2 c 3 ## 7 3 a 3 ## 8 3 b 2 ## 9 3 c 1 ## 10 4 a 1 ## 11 4 b 3 ## 12 4 c 1.67 ## 13 5 a 3 ## 14 5 b 3 ## 15 5 c 1
We can use pivot_wider() function from tidyr to convert the long tidy form to wider form data as we started with originally.
data_df %>% mutate(row_id = row_number()) %>% pivot_longer(-row_id, names_to="column_id", values_to="val") %>% group_by(column_id) %>% mutate(val= ifelse(is.na(val), mean(val,na.rm = T), val)) %>% pivot_wider(names_from="column_id", values_from = "val")
## # A tibble: 5 × 4 ## row_id a b c ## <int> <dbl> <dbl> <dbl> ## 1 1 3 3 1.67 ## 2 2 1 2.75 3 ## 3 3 3 2 1 ## 4 4 1 3 1.67 ## 5 5 3 3 1
Replace NAs with row means
We can use similar approach to replace missing values by row means. The big difference is we will be grouping by row ID instead of column ID. Everything else is the same.
data_df %>% pivot_longer(-row_id, names_to="column_id", values_to="val") %>% group_by(row_id) %>% mutate(val= ifelse(is.na(val), mean(val,na.rm = T), val)) %>% pivot_wider(names_from="column_id", values_from = "val")
Here is our result after replacing NAs with row means.
## # A tibble: 5 × 4 ## # Groups: row_id [5] ## row_id a b c ## <fct> <dbl> <dbl> <dbl> ## 1 1 3 3 3 ## 2 2 1 2 3 ## 3 3 3 2 1 ## 4 4 1 3 2 ## 5 5 3 3 1