How To Do PCA in tidyverse Framework?

In an earlier post, we saw a tutorial on how to do PCA in R using gapminder data set. Another interesting way of doing PCA is to follow the tidyverse framework.

In this post, we will see an example of doing PCA analysis using gapminder data in a tidy framework. Being the first attempt to do tidy PCA, it may not that tidy 🙂 Obviously, someone curious has already thought of this and done this. Check this awesome post on RStudio community a while ago.

There are two aspects to PCA analysis in a tidyverse framework. The first is to use tidyverse framework to perform PCA, say using prcomp. The second aspect is to get the results from PCA and analyzing it in tidyverse framework using broom package. We will do both here.

Let us first load the packages needed to tidy PCA.

library(tidyverse)
library(broom)
theme_set(theme_bw(base_size = 16))

Preparing Data for tidy PCA

Let us load the gapminder data in wide form.

gapminder_w_url <- "https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_wide.csv"
gapminder_wide <- read_csv(gapminder_w_url)
head(gapminder_wide)

Let us subset the gapminder data so that we have lifeExp data from three continents,Africa, Asia and Europe.

gapminder_life <- gapminder_wide %>%
  filter(continent %in% c("Africa","Asia","Europe")) %>%
  select(continent,country,starts_with('lifeExp'))

The data frame now contains country and continent as columns. Let us combine continent and country and assign as rownames.

gapminder_life <- gapminder_life %>% 
  unite("continent_country", c(continent,country)) %>%
  column_to_rownames("continent_country")
head(gapminder_life, n=3)
##                lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967
## Africa_Algeria       43.077       45.685       48.303       51.407
## Africa_Angola        30.015       31.999       34.000       35.985
## Africa_Benin         38.223       40.358       42.618       44.885

Now we have the data we need to do PCA analysis. The non-tidy way to PCA is pretty straight forward. We just use the prcomp on the data.


Performing PCA using tidyverse

To do PCA in a tidy way, we need to do bit more work. Let us keep the rownames, which contains country and continent information, for future plotting use. We will apply nest() function tidyr to save the data in a tibble. In short, nest() powerful grouping function available in tidyr. See Introduction to nest() in tidyr for a brief introduction.

gapminder_life %>% 
  rownames_to_column("continent_country") %>%
  nest()

We can see that the result of nest() is a tibble with a list and it contains our data. All we have done is kind of re-organised the data to different container that is powerful. Note that our data is stored in a list variable named “data”.

## # A tibble: 1 x 1
##   data               
##   <list>             
## 1 <tibble [115 × 13]>

Now we are ready to actually do PCA. To do PCA will apply map function purrr package. The arguments to map function is the data variable stored in the tibble and the second argument is the prcomp function that does PCA on the data. We will store the result in a new variable using mutate function.

tidy_pca_nest <- gapminder_life %>% 
  nest() %>%
  mutate(pca = map(data, ~ prcomp(.x, 
                                  center = TRUE, scale = TRUE)))

Now we have updated our original tibble to contain the results of PCA in addition to the original data.

## # A tibble: 1 x 2
##   data                pca         
##   <list>              <list>      
## 1 <tibble [115 × 13]> <S3: prcomp>

We have actually done PCA in a tidy framework and stored the results. We can use the PCA object and extract results. A nicer way to extract results in a tidy way is to use the broom package. broom package has many functions that one can apply on models and get the results in a tidy way.

In the case of PCA object, we can get the variance explained by each PC’s in a tidy way using tidy function from broom with “pcs” option.

tidy_pca_nest$pca %>%
  map(~tidy(.x, data = .y, "pcs")) %>%
  as.data.frame() 

##    PC    std.dev percent cumulative
## 1   1 3.32575141 0.92172    0.92172
## 2   2 0.82317518 0.05647    0.97819
## 3   3 0.40651818 0.01377    0.99196
## 4   4 0.22579761 0.00425    0.99621
## 5   5 0.12810112 0.00137    0.99757
## 6   6 0.10866005 0.00098    0.99856
## 7   7 0.09203518 0.00071    0.99926
## 8   8 0.07038804 0.00041    0.99968
## 9   9 0.03743852 0.00012    0.99979
## 10 10 0.03532150 0.00010    0.99990
## 11 11 0.02884124 0.00007    0.99997
## 12 12 0.01989124 0.00003    1.00000

We can apply map function to use the pca object and extract “pcs” and feed it to make scree plot. Scree plot nicely visualizes percentage variance explained by each principal components.

tidy_pca_nest$pca %>%
  map(~tidy(.x, data = .y, "pcs")) %>%
  as.data.frame() %>%
  ggplot(aes(x = PC, y=percent)) +
  geom_bar(stat="identity") + labs(x="PC",y="Variance Explained")

Screeplot clearly shows that the first principal component explains most of the variance in the data.

Scree plot from PCA in tidy framework

Now, let us actually take a look at the principal components. We can use broom package’s augment function to get the principal components with the original data from the PCA object.

Here we use map2 from purrr that can take two arguments, pca and data with mutate function..

tidy_pca_nest %>% 
  mutate(pca_aug = map2(pca, data, ~augment(.x, data = .y)))

We can see that we have added one more column to our tibble containing augmented pca.


## # A tibble: 1 x 3
##   data                pca          pca_aug            
##   <list>              <list>       <list>             
## 1 <tibble [115 × 13]> <S3: prcomp> <tibble [115 × 26]>

We can get the augmented PCA object, by unnesting it.

tidy_pca_nest %>% 
  mutate(pca_aug = map2(pca, data, ~augment(.x, data = .y))) %>%
  unnest(pca_aug)

The augmented PCA object contains original data including the country_continent variable that created and all the principal components.


## # A tibble: 115 x 26
##    .rownames continent_count… lifeExp_1952 lifeExp_1957 lifeExp_1962
##    <fct>     <chr>                   <dbl>        <dbl>        <dbl>
##  1 1         Africa_Algeria           43.1         45.7         48.3
##  2 2         Africa_Angola            30.0         32.0         34  

We can use the tibble to make the PCA plot between the first and second principal components. Note that the principal components are stored as .fittedPC1, ans so on.

tidy_pca_nest %>% 
  mutate(pca_aug = map2(pca, data, ~augment(.x, data = .y))) %>%
  unnest(pca_aug) %>%
  separate(continent_country, c("continent","country")) %>%
  ggplot(aes(x=.fittedPC1,y=.fittedPC2)) +geom_point(aes(color=continent),size=3, alpha=0.7) +
  labs(x="PC1", y="PC2") +
  theme(legend.position="top")

By coloring the data points in PCA plot by continent, we can see the structure in the data as before. It shows the pattern that countries from Africa clusters together, the countries from Europe clusters together and asian countries are in the middle.

tidy PCA plot: PC1 vs PC2

To summarize, we performed Principal Component Analysis (PCA) using tidyverse framework. We used functions from tidyverse’s dplyr, purrr and broom packages to do PCA on gapminder data.