Dimension Reduction techniques are one of the most useful methods in unsupervised learning of high dimensional datasets. In this post, we will learn how to use R to perform 6 most commonly used dimensionality reduction techniques,
- PCA: Principal Component Analysis
- SVD: Singular Value Decomposition
- ICA: Independent Component Analysis
- NMF: Non-negative Matrix Factorization
- tSNE
- UMAP
We will not focus the how these dimension reduction techniques work or the theory behind. Instead, we will focus on more practical applications of applying these methods to a dataset of interest in R. We will use palmer penguin dataset and apply all 6 dimension reduction methods. And this will help us understand these methods.
Let us get started by loading tidyverse and palmer penguin datasets.
library(tidyverse) theme_set(theme_bw(16)) library(palmerpenguins)
Most common dimensionality reduction techniques like PCA and SVD are readily available in R. However, for other dimension reduction techniques like, NMF, ICA, tSNE, and UMAP, we need to install and load R packages. Here we load the packages NMF, fastICA, umap, and Rtsne.
library(NMF) library(fastICA) library(umap) library(Rtsne)
Taking a glimpse of penguin data and select some variables of interest for doing dimension reduction analysis.
penguins %>% glimpse() ## Rows: 344 ## Columns: 8 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, … ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, … ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, … ## $ sex <fct> male, female, female, NA, female, male, female, male… ## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
We will mainly use the numerical variables for applying the 6 dimension reduction techniques. Also, let us scale the data to normalize each variable.
scaled_data <- penguins %>% drop_na()%>% mutate(year=factor(year)) %>% select(where(is.numeric)) %>% scale()
Our normalized data looks like this. And we will be using this for most of the dimension reduction analysis.
scaled_data %>% head() ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## [1,] -0.8946955 0.7795590 -1.4246077 -0.5676206 ## [2,] -0.8215515 0.1194043 -1.0678666 -0.5055254 ## [3,] -0.6752636 0.4240910 -0.4257325 -1.1885721 ## [4,] -1.3335592 1.0842457 -0.5684290 -0.9401915 ## [5,] -0.8581235 1.7444004 -0.7824736 -0.6918109 ## [6,] -0.9312674 0.3225288 -1.4246077 -0.7228585
We know that the most of variation in our data is due to the species variable. Let us keep it separately to see how our unsupervised learning methods have performed.
species <- penguins %>% drop_na() %>% pull(species)
PCA Example in R
First, let us start with doing Principal Component Analysis (PCA) in R. In R we have multiple options to do PCA. Here we use prcomp() function to do PCA. To prcomp() function we feed the scaled data and get the results.
pca_prcomp_res <- prcomp(scaled_data)
The resulting object from prcomp() has multiple useful information. However, we will be mainly focusing on the top 2 principal components (PCs) available from the variable “x” in the list object.
str(pca_prcomp_res) ## List of 5 ## $ sdev : num [1:4] 1.657 0.882 0.607 0.328 ## $ rotation: num [1:4, 1:4] 0.454 -0.399 0.577 0.55 -0.6 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4" ## $ center : Named num [1:4] 3.55e-16 5.57e-16 1.83e-16 -9.27e-17 ## ..- attr(*, "names")= chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## $ scale : Named num [1:4] 5.47 1.97 14.02 805.22 ## ..- attr(*, "names")= chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## $ x : num [1:333, 1:4] -1.85 -1.31 -1.37 -1.88 -1.92 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : NULL ## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4" ## - attr(*, "class")= chr "prcomp"
Let us make a PCA plot using ggplot(), where we plot PC1 vs PC2. And also color the data points by the “global” species variable we saved before.
p_pca <- pca_prcomp_res$x %>% as.data.frame() %>% ggplot(aes(x=PC1, y=PC2, color=species))+ geom_point()+ labs(subtitle="PCA") print(p_pca) ggsave("PCA_plot_in_R.png",width=8, height=6)
We can see that the first PC nicely captures the difference between the species Gentoo and the other two. And the second PC captures the variation in between Adlie and Chinstrap.
SVD Example in R
SVD, Singular Value Decomposition, is a Dimensionality Reduction technique related to PCA. In R, we use svd() function to perform singular value decomposition.
svd_res <- svd(scaled_data)
The resulting object has 3 matrices u,v, and d.
str(svd_res) ## List of 3 ## $ d: num [1:4] 30.19 16.07 11.06 5.98 ## $ u: num [1:333, 1:4] -0.0613 -0.0435 -0.0455 -0.0624 -0.0635 ... ## $ v: num [1:4, 1:4] 0.454 -0.399 0.577 0.55 -0.6 ...
We are interested in the singular values (or principal components) from the matrix u. Here we make a plot between the top 2 singular values and color by species as before.
p_svd <- svd_res$u %>% as.data.frame() %>% rename(SV1=V1, SV2=V2) %>% ggplot(aes(x=SV1, y=SV2, color=species))+ geom_point()+ labs(subtitle="SVD") print(p_svd) ggsave("SVD_plot_in_R.png", width=8, height=6)
We get a scatter plot that is very similar to our PCA plot earlier.
ICA Example in R
Independent Component Analysis (ICA) is another commonly used dimension reduction technique. In R we can use fastICA package to perform ICA. In the simplest settings, we input our scaled data and the number of components to find.
ica_res <- fastICA(scaled_data, n.comp=2)
In the resulting ICA fit object, we are interested in the variable S and we can see that it has two components.
str(ica_res) ## List of 5 ## $ X: num [1:333, 1:4] -0.895 -0.822 -0.675 -1.334 -0.858 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : NULL ## .. ..$ : chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## ..- attr(*, "scaled:center")= Named num [1:4] 3.55e-16 5.57e-16 1.83e-16 -9.27e-17 ## .. ..- attr(*, "names")= chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## ..- attr(*, "scaled:scale")= Named num [1:4] 5.47 1.97 14.02 805.22 ## .. ..- attr(*, "names")= chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## $ K: num [1:4, 1:2] -0.274 0.241 -0.349 -0.332 -0.681 ... ## $ W: num [1:2, 1:2] -0.427 -0.904 0.904 -0.427 ## $ A: num [1:2, 1:4] 0.798 -0.453 0.353 0.896 0.412 ... ## $ S: num [1:333, 1:2] -0.444 -0.794 -0.52 -0.498 0.344 ...
We can make a scatter plot between the two independent components colored by species variable. We can clearly see that the two ICs has nicely captured the species variable.
p_ica <- ica_res$S %>% as.data.frame() %>% rename(IC1=V1, IC2=V2) %>% ggplot(aes(x=IC1, y=IC2, color=species))+ geom_point()+ labs(subtitle="ICA") print(p_ica) ggsave("ICA_plot_in_R.png", width=8, height=6)
NMF Example in R
Non-negative Matrix Factorization, as the name suggests its input matrix to contain only non-negative values. If you take peek at our scaled/normalized data it has negative values. So we will use the min-max scaler normalization that transforms the data 0-1 range.
Here we get the numerical data from the penguins dataframe
data <- penguins %>% drop_na()%>% mutate(year=factor(year)) %>% select(where(is.numeric))
Our little function to do the min-max-scaling.
min_max_scale <- function(x, na.rm = TRUE) { return((x- min(x)) /(max(x)-min(x))) }
Now, let normalize our penguin raw data with min-max-scaler and we will use this transformed data as input to perform NMF using nmf() function. We also specify we are interested the the first 2 factors.
nmf_res <- nmf(apply(data, 2, min_max_scale), rank=2)
From the resulting NMF fit object, we are interested in the variable/matrix “W” and we can access it using nmf_res$fit@W. As before, we make a scatter plot between the two factors and color it by species variable.
p_nmf <- nmf_res@fit@W %>% as.data.frame()%>% rename(NMF1=V1, NMF2=V2) %>% ggplot(aes(x=NMF1, y=NMF2, color=species))+ geom_point()+ labs(subtitle="NMF") print(p_nmf) ggsave("NMF_plot_in_R.png", width=8, height=6)
tSNE in R
t-Distributed Stochastic Neighbor Embedding, t-SNE is a technique for dimensionality reduction commonly used for visualizing high dimensional datasets. Unlike PCA/ICA/NMF, tSNE is a non-linear dimension reduction technique.
In R, we can use Rtsne package to perform tSNE analysis. Here we simply use the scaled data as input Rtsne() function.
tsne_res <- Rtsne(scaled_data)
And the resulting tSNE object has number default parameters it used and we are interested in the Y variable that contains the two tSNE components it estimated.
str(tsne_res) ## List of 14 ## $ N : int 333 ## $ Y : num [1:333, 1:2] -12.52 -10.14 -9.06 -16.67 -17.95 ... ## $ costs : num [1:333] 0.000942 0.002241 0.002099 0.001485 0.0006 ... ## $ itercosts : num [1:20] 50.4 49.3 49 48.9 48.8 ... ## $ origD : int 4 ## $ perplexity : num 30 ## $ theta : num 0.5 ## $ max_iter : num 1000 ## $ stop_lying_iter : int 250 ## $ mom_switch_iter : int 250 ## $ momentum : num 0.5 ## $ final_momentum : num 0.8 ## $ eta : num 200 ## $ exaggeration_factor: num 12 ## - attr(*, "class")= chr [1:2] "Rtsne" "list"
By making a tSNE scatter plot, we can clearly see that these components have nicely capture the latent “species” variable driving variation in our data.
p_tsne <- tsne_res$Y %>% as.data.frame() %>% rename(tSNE1=V1, tSNE2=V2) %>% ggplot(aes(x=tSNE1, y=tSNE2, color=species))+ geom_point()+ labs(subtitle="tSNE") print(p_tsne) ggsave("tSNE_plot_in_R.png", width=8, height=6)
UMAP Example in R
UMAP aka Uniform Manifold Approximation and Projection for Dimension Reduction is a relatively new dimension reduction technique that is commonly used for visualisation high dimensional data like t-SNE. UMAP is also a non-linear dimension reduction.
umap_res <- umap(scaled_data)
str(umap_res) # List of 4 ## $ layout: num [1:333, 1:2] -8.05 -7.13 -6.89 -9.32 -10.5 ... ## $ data : num [1:333, 1:4] -0.895 -0.822 -0.675 -1.334 -0.858 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : NULL ## .. ..$ : chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## ..- attr(*, "scaled:center")= Named num [1:4] 44 17.2 201 4207.1 ## .. ..- attr(*, "names")= chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## ..- attr(*, "scaled:scale")= Named num [1:4] 5.47 1.97 14.02 805.22 ## .. ..- attr(*, "names")= chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## $ knn :List of 2 ## ..$ indexes : int [1:333, 1:15] 1 2 3 4 5 6 7 8 9 10 ... ## ..$ distances: num [1:333, 1:15] 0 0 0 0 0 0 0 0 0 0 ... ## ..- attr(*, "class")= chr "umap.knn" ## $ config:List of 24 ## ..$ n_neighbors : int 15 ## ..$ n_components : int 2 ## ..$ metric : chr "euclidean" ## ..$ n_epochs : int 200 ... ... ...
With UMAP scatter plot colored by species, we can clearly see that the the UMAP dimesions have nicely captured the latent “species” variable driving variation in our data.
p_umap <- umap_res$layout %>% as.data.frame() %>% rename(UMAP1=V1, UMAP2=V2) %>% ggplot(aes(x=UMAP1, y=UMAP2, color=species))+ geom_point()+ labs(subtitle="UMAP") print(p_umap) ggsave("UMAP_plot_in_R.png", width=8, height=6)
One of the things striking from these 6 dimensionality reduction techniques is that in a simpler data set like this with a strong latent variable (species) all of the methods perform nicely and captured the latent “species” factor.