Performing PCA in tidyverse framework is probably one of my go to approaches for any high-dimensional data. In R, we can easily perform Principal Component analysis (PCA) using prcomp() and a few other options. A common challenge I face is to quickly look at the PCA results using a few different plots.
With continual development of broom R package it has become much easier to do PCA in using tidyverse framework. The broom package takes the messy and complex output from common statistical model functions in R, like lm, glm, and pca and clean them up as tidy tibbles. And it is designed to fit PCA into tidyverse way of analysis.
The post written a while ago for fun (not for usability) on how to do PCA using tidyverse framework looks so ugly now (and not useful):).
So this post is for future self for most common things I do while doing PCA in tidyverse framework.
Let us load the packages needed.
library(tidyverse) theme_set(theme_bw(16)) library(palmerpenguins)
## - Attaching packages ----------- tidyverse 1.3.2 ?? ## - ggplot2 3.3.6 ? purrr 0.3.4 ## - tibble 3.1.7 ? dplyr 1.0.9 ## - tidyr 1.2.0 ? stringr 1.4.0 ## - readr 2.1.2 ? forcats 0.5.1 ## -- Conflicts ---------------- tidyverse_conflicts() ?? ## - dplyr::filter() masks stats::filter() ## - dplyr::lag() masks stats::lag()
One of the packages we will use to do PCA in tidyverse framework is broom.
library(broom) packageVersion(broom) ## [1] '1.0.1'
We will use our favorite palmer penguins dataset to perform PCA using tidyverse principles.
penguins %>% head() ## # A tibble: 6 × 8 ## species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex ## <fct> <fct> <dbl> <dbl> <int> <int> <fct> ## 1 Adelie Torge… 39.1 18.7 181 3750 male ## 2 Adelie Torge… 39.5 17.4 186 3800 fema… ## 3 Adelie Torge… 40.3 18 195 3250 fema… ## 4 Adelie Torge… NA NA NA NA <NA> ## 5 Adelie Torge… 36.7 19.3 193 3450 fema… ## 6 Adelie Torge… 39.3 20.6 190 3650 male ## # … with 1 more variable: year <int>
Scaling the data before doing PCA with tidyverse
First, let us select the numerical variables after dropping any rows with missing values (NAs). We will also remove the column year specifying the year the data was collected.
One of the first steps before doing PCA is to scale or standardize the data. Here we use scale() function to standardise each column.
penguins %>% drop_na() %>% select(where(is.numeric)) %>% select(-year) %>% scale() %>% head()
This is how our data looks after scaling.
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## [1,] -0.8946955 0.7795590 -1.4246077 -0.5676206 ## [2,] -0.8215515 0.1194043 -1.0678666 -0.5055254 ## [3,] -0.6752636 0.4240910 -0.4257325 -1.1885721 ## [4,] -1.3335592 1.0842457 -0.5684290 -0.9401915 ## [5,] -0.8581235 1.7444004 -0.7824736 -0.6918109 ## [6,] -0.9312674 0.3225288 -1.4246077 -0.7228585
By calculating means of the columns before and after scaling we can see how much difference standardising makes. Below, we have mean values of each column before scaling and note the range of mean values is really high, from 17 to 4300.
penguins %>% drop_na() %>% select(where(is.numeric)) %>% select(-year) %>% colMeans() ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## 43.99279 17.16486 200.96697 4207.05706
After scaling, we see that mean values of each column is close to zero, as scale() function mean centers the data.
penguins %>% drop_na() %>% select(where(is.numeric)) %>% select(-year) %>% scale() %>% colMeans() ## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g ## 3.552797e-16 5.572578e-16 1.834379e-16 -9.274780e-17
PCA using tidyverse principles
Now we are ready to apply PCA using prcomp() function on the scaled data. The key argument or input for prcomp() function is the data. Here we just pipe the scaled data to prcomp() function.
We also save the result from PCA as pca_fit.
pca_fit <- penguins %>% drop_na() %>% select(where(is.numeric)) %>% select(-year) %>% scale() %>% prcomp()
By printing the structure of PCA result object, which is an object of class prcomp, we can immediately see the complexities in accessing different aspects of PCA results from the object.
str(pca_fit) ## List of 5 ## $ sdev : num [1:4] 1.657 0.882 0.607 0.328 ## $ rotation: num [1:4, 1:4] 0.454 -0.399 0.577 0.55 -0.6 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4" ## $ center : Named num [1:4] 3.55e-16 5.57e-16 1.83e-16 -9.27e-17 ## ..- attr(*, "names")= chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## $ scale : Named num [1:4] 5.47 1.97 14.02 805.22 ## ..- attr(*, "names")= chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g" ## $ x : num [1:333, 1:4] -1.85 -1.31 -1.37 -1.88 -1.92 ... ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : NULL ## .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4" ## - attr(*, "class")= chr "prcomp"
We can quickly see the summary of PCA results
pca_fit ## Standard deviations (1, .., p=4): ## [1] 1.6569115 0.8821095 0.6071594 0.3284579 ## ## Rotation (n x k) = (4 x 4): ## PC1 PC2 PC3 PC4 ## bill_length_mm 0.4537532 -0.60019490 -0.6424951 0.1451695 ## bill_depth_mm -0.3990472 -0.79616951 0.4258004 -0.1599044 ## flipper_length_mm 0.5768250 -0.00578817 0.2360952 -0.7819837 ## body_mass_g 0.5496747 -0.07646366 0.5917374 0.5846861
Scree plot using broom tidy() function
Here is where broom package is a god sent. One of the key functions of broom R package is tidy(). As the name suggests, tidy() function cleans up the specified results from PCA and give us the results as tibbles.
Brooms’ tidy functions can take different commonly used notations to get the PCA results in different ways. It cab be confusing for new PCA users. For example, to get the information about the eigenvalues, we can use “d”, “eigenvalues”, and “pcs” with tidy() function.
To start with let us inspect the eigenvalues and make a scree plot understand the contribution of each PC to the data in terms of percent variance explained.
pca_fit %>% tidy("pcs") ## # A tibble: 4 × 4 ## PC std.dev percent cumulative ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1.66 0.686 0.686 ## 2 2 0.882 0.195 0.881 ## 3 3 0.607 0.0922 0.973 ## 4 4 0.328 0.0270 1
We will get the same results using “d” or “eigenvalues” as argument to tidy() function as shown below.
pca_fit %>% tidy("d") ## # A tibble: 4 × 4 ## PC std.dev percent cumulative ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1.66 0.686 0.686 ## 2 2 0.882 0.195 0.881 ## 3 3 0.607 0.0922 0.973 ## 4 4 0.328 0.0270 1
pca_fit %>% tidy(matrix="eigenvalues") ## # A tibble: 4 × 4 ## PC std.dev percent cumulative ## <dbl> <dbl> <dbl> <dbl> ## 1 1 1.66 0.686 0.686 ## 2 2 0.882 0.195 0.881 ## 3 3 0.607 0.0922 0.973 ## 4 4 0.328 0.0270 1
Let us make screeplot, a barplot, showing the percentage of variance explained by each PC. In this dataset we have four PCs as there are four columns of data.
pca_fit %>% tidy("pcs") %>% ggplot(aes(x=PC, y=percent))+ geom_col(fill="dodgerblue", alpha=0.7) + scale_y_continuous(labels=scales::label_percent(), breaks = scales::breaks_pretty(n=6))+ labs(y= "Variance explained", title="Scree plot")
Another variant of scree plot is plotting the cumulative variance explained vs the PCs as line or a dot plot.
pca_fit %>% tidy("pcs") %>% ggplot(aes(x=PC, y=cumulative))+ geom_point(size=4) + geom_line(color="dodgerblue", size=2)+ scale_y_continuous(labels=scales::label_percent(), breaks = scales::breaks_pretty(n=6))+ labs(y= "Cumulative Variance explained", title="Scree plot")
broom’s tidy() function: one function to get all things about PCs
We can use tidy() function to get other PCA results as tidy tibbles. For example, we can get “u”, “samples”, “scores”, or “x”, the
information about the map from the original space into principle components space
. using tidy() function as follows.
pca_fit %>% tidy("x") pca_fit %>% tidy("scores") pca_fit %>% tidy("u") pca_fit %>% tidy("samples")
All of the above code give us the same information about PCs in samples space.
## # A tibble: 1,332 × 3 ## row PC value ## <int> <dbl> <dbl> ## 1 1 1 -1.85 ## 2 1 2 -0.0320 ## 3 1 3 0.235 ## 4 1 4 0.528 ## 5 2 1 -1.31 ## 6 2 2 0.443 ## 7 2 3 0.0274 ## 8 2 4 0.401 ## 9 3 1 -1.37 ## 10 3 2 0.161 ## # … with 1,322 more rows
Similarly, we can use tidy() function with these four arguments “v”, “rotation”, “loadings” or “variables” to get the
information about the map from principle components space back into the original space.
pca_fit %>% tidy("v") pca_fit %>% tidy("rotation") pca_fit %>% tidy("loadings") pca_fit %>% tidy("variables")
All of the above use tidy() function with these four different arguments give us the following information.
# A tibble: 16 × 3 ## column PC value ## <chr> <dbl> <dbl> ## 1 bill_length_mm 1 0.454 ## 2 bill_length_mm 2 -0.600 ## 3 bill_length_mm 3 -0.642 ## 4 bill_length_mm 4 0.145 ## 5 bill_depth_mm 1 -0.399 ## 6 bill_depth_mm 2 -0.796 ## 7 bill_depth_mm 3 0.426 ## 8 bill_depth_mm 4 -0.160 ## 9 flipper_length_mm 1 0.577 ## 10 flipper_length_mm 2 -0.00579 ## 11 flipper_length_mm 3 0.236 ## 12 flipper_length_mm 4 -0.782 ## 13 body_mass_g 1 0.550 ## 14 body_mass_g 2 -0.0765 ## 15 body_mass_g 3 0.592 ## 16 body_mass_g 4 0.585
PCA plot with PCA using tidyverse framework
Screeplot gives us the hint on the number of PCs to consider for downstream analysis. Typically one starts focusing on the top 2 PCs as they explain the most of variation in the data.
Now let use another important function augment(), in broom package. The augment() function in broom helps you add the original data to the PCA results easily in a tidyverse framework.
In the example, below add the original penguin data without any missing values with PCA results using augment() function. We can see the in the result of augment() function, the first columns are the original data and last columns are PCs with column names “.fittedPC1”, “.fittedPC2”, and so on.
pca_fit %>% augment(penguins %>% drop_na()) %>% head() ## # A tibble: 6 × 13 ## .rownames species island bill_length_mm bill_depth_mm flipper_length_mm ## <chr> <fct> <fct> <dbl> <dbl> <int> ## 1 1 Adelie Torgersen 39.1 18.7 181 ## 2 2 Adelie Torgersen 39.5 17.4 186 ## 3 3 Adelie Torgersen 40.3 18 195 ## 4 4 Adelie Torgersen 36.7 19.3 193 ## 5 5 Adelie Torgersen 39.3 20.6 190 ## 6 6 Adelie Torgersen 38.9 17.8 181 ## # … with 7 more variables: body_mass_g <int>, sex <fct>, year <int>, ## # .fittedPC1 <dbl>, .fittedPC2 <dbl>, .fittedPC3 <dbl>, .fittedPC4 <dbl>
The augmented data easily allows us to look the PCs in the context of key variables from the original data. Before doing that, let us first simplify the names of “.fittedPC”. Here we use rename_with() function in dplyr in combination with gsub() to rename the columns “.fittedPC1” to “PC1”.
pca_fit %>% augment(penguins %>% drop_na()) %>% rename_with(function(x){gsub(".fitted","",x)}) %>% head() ## # A tibble: 6 × 13 ## .rownames species island bill_length_mm bill_depth_mm flipper_length_mm ## <chr> <fct> <fct> <dbl> <dbl> <int> ## 1 1 Adelie Torgersen 39.1 18.7 181 ## 2 2 Adelie Torgersen 39.5 17.4 186 ## 3 3 Adelie Torgersen 40.3 18 195 ## 4 4 Adelie Torgersen 36.7 19.3 193 ## 5 5 Adelie Torgersen 39.3 20.6 190 ## 6 6 Adelie Torgersen 38.9 17.8 181 ## # … with 7 more variables: body_mass_g <int>, sex <fct>, year <int>, PC1 <dbl>, ## # PC2 <dbl>, PC3 <dbl>, PC4 <dbl>
Also, let us save the percent variance explained by each Principal Component as a vector.
variance_exp <- pca_fit %>% tidy("pcs") %>% pull(percent)
Now we can plot PC1 vs PC2 and color by a key variable “species” data.
pca_fit %>% augment(penguins %>% drop_na()) %>% rename_with(function(x){gsub(".fitted","",x)}) %>% ggplot(aes(x = PC1, y = PC2, color=species))+ geom_point()+ labs(x = paste0("PC1: ",round(variance_exp[1]*100), "%"), y = paste0("PC2: ",round(variance_exp[2]*100), "%"))
We can clearly see that the first PC on x-axis, which explains about 70% of the variation, in the data nicely separates Gentoo penguins from the other two species. Similarly the second PC separates Adelie and Chinstrap.
PCs and other variables in the data
Another useful set of plots while doing PCA is looking at the association of each PC with other variables/features in the data.
In the example below, we make boxplots to understand the association between the first PC and species variable.
pca_fit %>% augment(penguins %>% drop_na()) %>% rename_with(function(x){gsub(".fitted","",x)}) %>% ggplot(aes(x = species, y = PC1, color=species))+ geom_boxplot(outlier.shape = NA)+ geom_jitter(width=0.15)+ labs(y = paste0("PC1: ",round(variance_exp[1]*100), "%"))+ theme(legend.position = "none")
As we saw in the PCA plot, we can clearly see the association between the first PC and the species.
Next we add second variable to mix, in this case sex variable in addition to species.
pca_fit %>% augment(penguins %>% drop_na()) %>% rename_with(function(x){gsub(".fitted","",x)}) %>% ggplot(aes(x = species, y = PC1, color=sex))+ geom_boxplot(outlier.shape = NA)+ geom_point(position=position_jitterdodge())+ labs(y = paste0("PC1: ",round(variance_exp[1]*100), "%"))+ theme(legend.position = "top") ggsave("PC1_vs_species_sex_PCA_with_tidyverse.png")
And we can clearly see how the first PC has also captured the effect of sex in the data.
Next we look at how PC1 is associated with one of the features of data. Here we look at the association of body mass with PC1. As both are numerical variables, we make a scatter plot and add colors to the data by species variable.
pca_fit %>% augment(penguins %>% drop_na()) %>% rename_with(function(x){gsub(".fitted","",x)}) %>% ggplot(aes(x = PC1, y = body_mass_g, color=species))+ geom_point()+ geom_smooth(method="lm")+ labs(x = paste0("PC1: ",round(variance_exp[1]*100), "%"))+ theme(legend.position = "top") ggsave("PC1_vs_body_mass_PCA_with_tidyverse.png")
Correlation of PCs and other numerical variables
Instead of looking at the correlation of each PC against all numerical variables separately, we can perform simple correlation analysis of all Principal components against all numerical variables at once using the R package corrr. Here we use correlate() function from corrr package to compute pearson correlation coefficient on all numerical variables and use shave() function to print the lower triangular values.
library(corrr) pca_fit %>% augment(penguins %>% drop_na()) %>% rename_with(function(x){gsub(".fitted","",x)}) %>% select(-.rownames) %>% select(where(is.numeric)) %>% select(-year) %>% correlate() %>% shave()
## Correlation computed with ## • Method: 'pearson' ## • Missing treated using: 'pairwise.complete.obs' ## # A tibble: 8 × 9 ## term bill_length_mm bill_depth_mm flipper_length_… body_mass_g PC1 ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 bill_leng… NA NA NA NA NA ## 2 bill_dept… -0.229 NA NA NA NA ## 3 flipper_l… 0.653 -0.578 NA NA NA ## 4 body_mass… 0.589 -0.472 0.873 NA NA ## 5 PC1 0.752 -0.661 0.956 0.911 NA ## 6 PC2 -0.529 -0.702 -0.00511 -0.0674 -2.55e-16 ## 7 PC3 -0.390 0.259 0.143 0.359 4.36e-16 ## 8 PC4 0.0477 -0.0525 -0.257 0.192 -3.44e-18 ## # … with 3 more variables: PC2 <dbl>, PC3 <dbl>, PC4 <dbl>
With corrr package, we can also quickly make a correlation plot using rplot() function as shown below.
library(corrr) pca_fit %>% augment(penguins %>% drop_na()) %>% rename_with(function(x){gsub(".fitted","",x)}) %>% select(-.rownames) %>% select(where(is.numeric)) %>% select(-year) %>% correlate() %>% shave() %>% rplot(shape = 15)+ scale_x_discrete(guide = guide_axis(n.dodge=2)) + theme_bw(16) ggsave("PCs_correlation_PCA_with_tidyverse.png", width=9, height=6)
2 comments
Comments are closed.