PCA in tidyverse framework - Python and R Tips

Performing PCA in tidyverse framework is probably one of my go to approaches for any high-dimensional data. In R, we can easily perform Principal Component analysis (PCA) using prcomp() and a few other options. A common challenge I face is to quickly look at the PCA results using a few different plots.

With continual development of broom R package it has become much easier to do PCA in using tidyverse framework. The broom package takes the messy and complex output from common statistical model functions in R, like lm, glm, and pca and clean them up as tidy tibbles. And it is designed to fit PCA into tidyverse way of analysis.

The post written a while ago for fun (not for usability) on how to do PCA using tidyverse framework looks so ugly now (and not useful):).

So this post is for future self for most common things I do while doing PCA in tidyverse framework.

Let us load the packages needed.

library(tidyverse)
theme_set(theme_bw(16))
library(palmerpenguins)

## - Attaching packages ----------- tidyverse 1.3.2 ??
## - ggplot2 3.3.6     ? purrr   0.3.4
## - tibble  3.1.7     ? dplyr   1.0.9
## - tidyr   1.2.0     ? stringr 1.4.0
## - readr   2.1.2     ? forcats 0.5.1
## -- Conflicts ---------------- tidyverse_conflicts() ??
## - dplyr::filter() masks stats::filter()
## - dplyr::lag()    masks stats::lag()

One of the packages we will use to do PCA in tidyverse framework is broom.

library(broom)
packageVersion(broom)
## [1] '1.0.1'

We will use our favorite palmer penguins dataset to perform PCA using tidyverse principles.

penguins %>% head()

## # A tibble: 6 × 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…           39.1          18.7              181        3750 male 
## 2 Adelie  Torge…           39.5          17.4              186        3800 fema…
## 3 Adelie  Torge…           40.3          18                195        3250 fema…
## 4 Adelie  Torge…           NA            NA                 NA          NA <NA> 
## 5 Adelie  Torge…           36.7          19.3              193        3450 fema…
## 6 Adelie  Torge…           39.3          20.6              190        3650 male 
## # … with 1 more variable: year <int>

Scaling the data before doing PCA with tidyverse

First, let us select the numerical variables after dropping any rows with missing values (NAs). We will also remove the column year specifying the year the data was collected.

One of the first steps before doing PCA is to scale or standardize the data. Here we use scale() function to standardise each column.

penguins %>%
  drop_na() %>%
  select(where(is.numeric)) %>%
  select(-year) %>%
  scale() %>%
  head()

This is how our data looks after scaling.

##      bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## [1,]     -0.8946955     0.7795590        -1.4246077  -0.5676206
## [2,]     -0.8215515     0.1194043        -1.0678666  -0.5055254
## [3,]     -0.6752636     0.4240910        -0.4257325  -1.1885721
## [4,]     -1.3335592     1.0842457        -0.5684290  -0.9401915
## [5,]     -0.8581235     1.7444004        -0.7824736  -0.6918109
## [6,]     -0.9312674     0.3225288        -1.4246077  -0.7228585

By calculating means of the columns before and after scaling we can see how much difference standardising makes. Below, we have mean values of each column before scaling and note the range of mean values is really high, from 17 to 4300.

penguins %>%
  drop_na() %>%
  select(where(is.numeric)) %>%
  select(-year) %>%
  colMeans()
##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##          43.99279          17.16486         200.96697        4207.05706

After scaling, we see that mean values of each column is close to zero, as scale() function mean centers the data.

penguins %>%
  drop_na() %>%
  select(where(is.numeric)) %>%
  select(-year) %>%
  scale() %>%
  colMeans()
##    bill_length_mm     bill_depth_mm flipper_length_mm       body_mass_g 
##      3.552797e-16      5.572578e-16      1.834379e-16     -9.274780e-17

PCA using tidyverse principles

Now we are ready to apply PCA using prcomp() function on the scaled data. The key argument or input for prcomp() function is the data. Here we just pipe the scaled data to prcomp() function.

We also save the result from PCA as pca_fit.

pca_fit <- penguins %>%
  drop_na() %>%
  select(where(is.numeric)) %>%
  select(-year) %>%
  scale() %>%
  prcomp()

By printing the structure of PCA result object, which is an object of class prcomp, we can immediately see the complexities in accessing different aspects of PCA results from the object.

str(pca_fit)

## List of 5
##  $ sdev    : num [1:4] 1.657 0.882 0.607 0.328
##  $ rotation: num [1:4, 1:4] 0.454 -0.399 0.577 0.55 -0.6 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g"
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  $ center  : Named num [1:4] 3.55e-16 5.57e-16 1.83e-16 -9.27e-17
##   ..- attr(*, "names")= chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g"
##  $ scale   : Named num [1:4] 5.47 1.97 14.02 805.22
##   ..- attr(*, "names")= chr [1:4] "bill_length_mm" "bill_depth_mm" "flipper_length_mm" "body_mass_g"
##  $ x       : num [1:333, 1:4] -1.85 -1.31 -1.37 -1.88 -1.92 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  - attr(*, "class")= chr "prcomp"

We can quickly see the summary of PCA results

pca_fit

## Standard deviations (1, .., p=4):
## [1] 1.6569115 0.8821095 0.6071594 0.3284579
## 
## Rotation (n x k) = (4 x 4):
##                          PC1         PC2        PC3        PC4
## bill_length_mm     0.4537532 -0.60019490 -0.6424951  0.1451695
## bill_depth_mm     -0.3990472 -0.79616951  0.4258004 -0.1599044
## flipper_length_mm  0.5768250 -0.00578817  0.2360952 -0.7819837
## body_mass_g        0.5496747 -0.07646366  0.5917374  0.5846861

Scree plot using broom tidy() function

Here is where broom package is a god sent. One of the key functions of broom R package is tidy(). As the name suggests, tidy() function cleans up the specified results from PCA and give us the results as tibbles.

Brooms’ tidy functions can take different commonly used notations to get the PCA results in different ways. It cab be confusing for new PCA users. For example, to get the information about the eigenvalues, we can use “d”, “eigenvalues”, and “pcs” with tidy() function.

To start with let us inspect the eigenvalues and make a scree plot understand the contribution of each PC to the data in terms of percent variance explained.

pca_fit %>%
  tidy("pcs") 

## # A tibble: 4 × 4
##      PC std.dev percent cumulative
##   <dbl>   <dbl>   <dbl>      <dbl>
## 1     1   1.66   0.686       0.686
## 2     2   0.882  0.195       0.881
## 3     3   0.607  0.0922      0.973
## 4     4   0.328  0.0270      1

We will get the same results using “d” or “eigenvalues” as argument to tidy() function as shown below.

pca_fit %>%
  tidy("d") 

## # A tibble: 4 × 4
##      PC std.dev percent cumulative
##   <dbl>   <dbl>   <dbl>      <dbl>
## 1     1   1.66   0.686       0.686
## 2     2   0.882  0.195       0.881
## 3     3   0.607  0.0922      0.973
## 4     4   0.328  0.0270      1

pca_fit %>%
  tidy(matrix="eigenvalues") 
## # A tibble: 4 × 4
##      PC std.dev percent cumulative
##   <dbl>   <dbl>   <dbl>      <dbl>
## 1     1   1.66   0.686       0.686
## 2     2   0.882  0.195       0.881
## 3     3   0.607  0.0922      0.973
## 4     4   0.328  0.0270      1

Let us make screeplot, a barplot, showing the percentage of variance explained by each PC. In this dataset we have four PCs as there are four columns of data.

pca_fit %>%
  tidy("pcs") %>%
  ggplot(aes(x=PC, y=percent))+
  geom_col(fill="dodgerblue", alpha=0.7) +
  scale_y_continuous(labels=scales::label_percent(),
                     breaks = scales::breaks_pretty(n=6))+
  labs(y= "Variance explained",
       title="Scree plot")

Screeplot showing variance explained by PCs

Another variant of scree plot is plotting the cumulative variance explained vs the PCs as line or a dot plot.

pca_fit %>%
  tidy("pcs") %>%
  ggplot(aes(x=PC, y=cumulative))+
  geom_point(size=4) +
  geom_line(color="dodgerblue", size=2)+
  scale_y_continuous(labels=scales::label_percent(),
                     breaks = scales::breaks_pretty(n=6))+
  labs(y= "Cumulative Variance explained",
       title="Scree plot")

Cumulative variance explained: PCA with tidyverse

broom’s tidy() function: one function to get all things about PCs

We can use tidy() function to get other PCA results as tidy tibbles. For example, we can get “u”, “samples”, “scores”, or “x”, the

information about the map from the original space into principle components space

. using tidy() function as follows.

pca_fit %>%
  tidy("x")

pca_fit %>%
  tidy("scores")

pca_fit %>%
  tidy("u")

pca_fit %>%
  tidy("samples")

All of the above code give us the same information about PCs in samples space.

## # A tibble: 1,332 × 3
##      row    PC   value
##    <int> <dbl>   <dbl>
##  1     1     1 -1.85  
##  2     1     2 -0.0320
##  3     1     3  0.235 
##  4     1     4  0.528 
##  5     2     1 -1.31  
##  6     2     2  0.443 
##  7     2     3  0.0274
##  8     2     4  0.401 
##  9     3     1 -1.37  
## 10     3     2  0.161 
## # … with 1,322 more rows

Similarly, we can use tidy() function with these four arguments “v”, “rotation”, “loadings” or “variables” to get the

information about the map from principle components space back into the original space.

pca_fit %>%
  tidy("v")

pca_fit %>%
  tidy("rotation")

pca_fit %>%
  tidy("loadings")

pca_fit %>%
  tidy("variables")

All of the above use tidy() function with these four different arguments give us the following information.

# A tibble: 16 × 3
##    column               PC    value
##    <chr>             <dbl>    <dbl>
##  1 bill_length_mm        1  0.454  
##  2 bill_length_mm        2 -0.600  
##  3 bill_length_mm        3 -0.642  
##  4 bill_length_mm        4  0.145  
##  5 bill_depth_mm         1 -0.399  
##  6 bill_depth_mm         2 -0.796  
##  7 bill_depth_mm         3  0.426  
##  8 bill_depth_mm         4 -0.160  
##  9 flipper_length_mm     1  0.577  
## 10 flipper_length_mm     2 -0.00579
## 11 flipper_length_mm     3  0.236  
## 12 flipper_length_mm     4 -0.782  
## 13 body_mass_g           1  0.550  
## 14 body_mass_g           2 -0.0765 
## 15 body_mass_g           3  0.592  
## 16 body_mass_g           4  0.585

PCA plot with PCA using tidyverse framework

Screeplot gives us the hint on the number of PCs to consider for downstream analysis. Typically one starts focusing on the top 2 PCs as they explain the most of variation in the data.

Now let use another important function augment(), in broom package. The augment() function in broom helps you add the original data to the PCA results easily in a tidyverse framework.

In the example, below add the original penguin data without any missing values with PCA results using augment() function. We can see the in the result of augment() function, the first columns are the original data and last columns are PCs with column names “.fittedPC1”, “.fittedPC2”, and so on.

pca_fit %>%
  augment(penguins %>% drop_na()) %>%
  head()
## # A tibble: 6 × 13
##   .rownames species island    bill_length_mm bill_depth_mm flipper_length_mm
##   <chr>     <fct>   <fct>              <dbl>         <dbl>             <int>
## 1 1         Adelie  Torgersen           39.1          18.7               181
## 2 2         Adelie  Torgersen           39.5          17.4               186
## 3 3         Adelie  Torgersen           40.3          18                 195
## 4 4         Adelie  Torgersen           36.7          19.3               193
## 5 5         Adelie  Torgersen           39.3          20.6               190
## 6 6         Adelie  Torgersen           38.9          17.8               181
## # … with 7 more variables: body_mass_g <int>, sex <fct>, year <int>,
## #   .fittedPC1 <dbl>, .fittedPC2 <dbl>, .fittedPC3 <dbl>, .fittedPC4 <dbl>

The augmented data easily allows us to look the PCs in the context of key variables from the original data. Before doing that, let us first simplify the names of “.fittedPC”. Here we use rename_with() function in dplyr in combination with gsub() to rename the columns “.fittedPC1” to “PC1”.

pca_fit %>%
  augment(penguins %>% drop_na()) %>%
  rename_with(function(x){gsub(".fitted","",x)}) %>%
  head()

## # A tibble: 6 × 13
##   .rownames species island    bill_length_mm bill_depth_mm flipper_length_mm
##   <chr>     <fct>   <fct>              <dbl>         <dbl>             <int>
## 1 1         Adelie  Torgersen           39.1          18.7               181
## 2 2         Adelie  Torgersen           39.5          17.4               186
## 3 3         Adelie  Torgersen           40.3          18                 195
## 4 4         Adelie  Torgersen           36.7          19.3               193
## 5 5         Adelie  Torgersen           39.3          20.6               190
## 6 6         Adelie  Torgersen           38.9          17.8               181
## # … with 7 more variables: body_mass_g <int>, sex <fct>, year <int>, PC1 <dbl>,
## #   PC2 <dbl>, PC3 <dbl>, PC4 <dbl>

Also, let us save the percent variance explained by each Principal Component as a vector.

variance_exp <- pca_fit %>%  
  tidy("pcs") %>% 
  pull(percent)

Now we can plot PC1 vs PC2 and color by a key variable “species” data.

pca_fit %>%
  augment(penguins %>% drop_na()) %>%
  rename_with(function(x){gsub(".fitted","",x)}) %>%
  ggplot(aes(x = PC1, y = PC2, color=species))+
  geom_point()+
  labs(x = paste0("PC1: ",round(variance_exp[1]*100), "%"),
       y = paste0("PC2: ",round(variance_exp[2]*100), "%"))

We can clearly see that the first PC on x-axis, which explains about 70% of the variation, in the data nicely separates Gentoo penguins from the other two species. Similarly the second PC separates Adelie and Chinstrap.

PCA plot: PC1 vs PC2 – PCA with tidyverse

PCs and other variables in the data

Another useful set of plots while doing PCA is looking at the association of each PC with other variables/features in the data.

In the example below, we make boxplots to understand the association between the first PC and species variable.

pca_fit %>%
  augment(penguins %>% drop_na()) %>%
  rename_with(function(x){gsub(".fitted","",x)}) %>%
  ggplot(aes(x = species, y = PC1, color=species))+
  geom_boxplot(outlier.shape = NA)+
  geom_jitter(width=0.15)+
  labs(y = paste0("PC1: ",round(variance_exp[1]*100), "%"))+
  theme(legend.position = "none")

As we saw in the PCA plot, we can clearly see the association between the first PC and the species.

Boxplot: PC1 vs species – PCA with tidyverse

Next we add second variable to mix, in this case sex variable in addition to species.

pca_fit %>%
  augment(penguins %>% drop_na()) %>%
  rename_with(function(x){gsub(".fitted","",x)}) %>%
  ggplot(aes(x = species, y = PC1, color=sex))+
  geom_boxplot(outlier.shape = NA)+
  geom_point(position=position_jitterdodge())+
  labs(y = paste0("PC1: ",round(variance_exp[1]*100), "%"))+
  theme(legend.position = "top")
ggsave("PC1_vs_species_sex_PCA_with_tidyverse.png")

And we can clearly see how the first PC has also captured the effect of sex in the data.

Association of PC1 with other variables: PCA with tidyverse

Next we look at how PC1 is associated with one of the features of data. Here we look at the association of body mass with PC1. As both are numerical variables, we make a scatter plot and add colors to the data by species variable.

pca_fit %>%
  augment(penguins %>% drop_na()) %>%
  rename_with(function(x){gsub(".fitted","",x)}) %>%
  ggplot(aes(x = PC1, y = body_mass_g, color=species))+
  geom_point()+
  geom_smooth(method="lm")+
  labs(x = paste0("PC1: ",round(variance_exp[1]*100), "%"))+
  theme(legend.position = "top")
ggsave("PC1_vs_body_mass_PCA_with_tidyverse.png")

Scatter plot to see the association of PCs with features

Correlation of PCs and other numerical variables

Instead of looking at the correlation of each PC against all numerical variables separately, we can perform simple correlation analysis of all Principal components against all numerical variables at once using the R package corrr. Here we use correlate() function from corrr package to compute pearson correlation coefficient on all numerical variables and use shave() function to print the lower triangular values.

library(corrr)
pca_fit %>%
  augment(penguins %>% drop_na()) %>%
  rename_with(function(x){gsub(".fitted","",x)}) %>%
  select(-.rownames) %>%
  select(where(is.numeric)) %>%
  select(-year) %>%
  correlate() %>% 
  shave()

## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
## # A tibble: 8 × 9
##   term       bill_length_mm bill_depth_mm flipper_length_… body_mass_g       PC1
##   <chr>               <dbl>         <dbl>            <dbl>       <dbl>     <dbl>
## 1 bill_leng…        NA            NA              NA           NA      NA       
## 2 bill_dept…        -0.229        NA              NA           NA      NA       
## 3 flipper_l…         0.653        -0.578          NA           NA      NA       
## 4 body_mass…         0.589        -0.472           0.873       NA      NA       
## 5 PC1                0.752        -0.661           0.956        0.911  NA       
## 6 PC2               -0.529        -0.702          -0.00511     -0.0674 -2.55e-16
## 7 PC3               -0.390         0.259           0.143        0.359   4.36e-16
## 8 PC4                0.0477       -0.0525         -0.257        0.192  -3.44e-18
## # … with 3 more variables: PC2 <dbl>, PC3 <dbl>, PC4 <dbl>

With corrr package, we can also quickly make a correlation plot using rplot() function as shown below.

library(corrr)
pca_fit %>%
  augment(penguins %>% drop_na()) %>%
  rename_with(function(x){gsub(".fitted","",x)}) %>%
  select(-.rownames) %>%
  select(where(is.numeric)) %>%
  select(-year) %>%
  correlate() %>% 
  shave() %>%
  rplot(shape = 15)+
  scale_x_discrete(guide = guide_axis(n.dodge=2)) +
  theme_bw(16)

ggsave("PCs_correlation_PCA_with_tidyverse.png", width=9, height=6)