In an earlier post, we saw a tutorial on how to do PCA in R using gapminder data set. Another interesting way of doing PCA is to follow the tidyverse framework.
In this post, we will see an example of doing PCA analysis using gapminder data in a tidy framework. Being the first attempt to do tidy PCA, it may not that tidy 🙂 Obviously, someone curious has already thought of this and done this. Check this awesome post on RStudio community a while ago.
There are two aspects to PCA analysis in a tidyverse framework. The first is to use tidyverse framework to perform PCA, say using prcomp. The second aspect is to get the results from PCA and analyzing it in tidyverse framework using broom package. We will do both here.
Let us first load the packages needed to tidy PCA.
library(tidyverse) library(broom) theme_set(theme_bw(base_size = 16))
Preparing Data for tidy PCA
Let us load the gapminder data in wide form.
gapminder_w_url <- "https://raw.githubusercontent.com/swcarpentry/r-novice-gapminder/gh-pages/_episodes_rmd/data/gapminder_wide.csv" gapminder_wide <- read_csv(gapminder_w_url) head(gapminder_wide)
Let us subset the gapminder data so that we have lifeExp data from three continents,Africa, Asia and Europe.
gapminder_life <- gapminder_wide %>% filter(continent %in% c("Africa","Asia","Europe")) %>% select(continent,country,starts_with('lifeExp'))
The data frame now contains country and continent as columns. Let us combine continent and country and assign as rownames.
gapminder_life <- gapminder_life %>% unite("continent_country", c(continent,country)) %>% column_to_rownames("continent_country") head(gapminder_life, n=3) ## lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 ## Africa_Algeria 43.077 45.685 48.303 51.407 ## Africa_Angola 30.015 31.999 34.000 35.985 ## Africa_Benin 38.223 40.358 42.618 44.885
Now we have the data we need to do PCA analysis. The non-tidy way to PCA is pretty straight forward. We just use the prcomp on the data.
Performing PCA using tidyverse
To do PCA in a tidy way, we need to do bit more work. Let us keep the rownames, which contains country and continent information, for future plotting use. We will apply nest() function tidyr to save the data in a tibble. In short, nest() powerful grouping function available in tidyr. See Introduction to nest() in tidyr for a brief introduction.
gapminder_life %>% rownames_to_column("continent_country") %>% nest()
We can see that the result of nest() is a tibble with a list and it contains our data. All we have done is kind of re-organised the data to different container that is powerful. Note that our data is stored in a list variable named “data”.
## # A tibble: 1 x 1 ## data ## <list> ## 1 <tibble [115 × 13]>
Now we are ready to actually do PCA. To do PCA will apply map function purrr package. The arguments to map function is the data variable stored in the tibble and the second argument is the prcomp function that does PCA on the data. We will store the result in a new variable using mutate function.
tidy_pca_nest <- gapminder_life %>% nest() %>% mutate(pca = map(data, ~ prcomp(.x, center = TRUE, scale = TRUE)))
Now we have updated our original tibble to contain the results of PCA in addition to the original data.
## # A tibble: 1 x 2 ## data pca ## <list> <list> ## 1 <tibble [115 × 13]> <S3: prcomp>
We have actually done PCA in a tidy framework and stored the results. We can use the PCA object and extract results. A nicer way to extract results in a tidy way is to use the broom package. broom package has many functions that one can apply on models and get the results in a tidy way.
In the case of PCA object, we can get the variance explained by each PC’s in a tidy way using tidy function from broom with “pcs” option.
tidy_pca_nest$pca %>% map(~tidy(.x, data = .y, "pcs")) %>% as.data.frame() ## PC std.dev percent cumulative ## 1 1 3.32575141 0.92172 0.92172 ## 2 2 0.82317518 0.05647 0.97819 ## 3 3 0.40651818 0.01377 0.99196 ## 4 4 0.22579761 0.00425 0.99621 ## 5 5 0.12810112 0.00137 0.99757 ## 6 6 0.10866005 0.00098 0.99856 ## 7 7 0.09203518 0.00071 0.99926 ## 8 8 0.07038804 0.00041 0.99968 ## 9 9 0.03743852 0.00012 0.99979 ## 10 10 0.03532150 0.00010 0.99990 ## 11 11 0.02884124 0.00007 0.99997 ## 12 12 0.01989124 0.00003 1.00000
We can apply map function to use the pca object and extract “pcs” and feed it to make scree plot. Scree plot nicely visualizes percentage variance explained by each principal components.
tidy_pca_nest$pca %>% map(~tidy(.x, data = .y, "pcs")) %>% as.data.frame() %>% ggplot(aes(x = PC, y=percent)) + geom_bar(stat="identity") + labs(x="PC",y="Variance Explained")
Screeplot clearly shows that the first principal component explains most of the variance in the data.
Now, let us actually take a look at the principal components. We can use broom package’s augment function to get the principal components with the original data from the PCA object.
Here we use map2 from purrr that can take two arguments, pca and data with mutate function..
tidy_pca_nest %>% mutate(pca_aug = map2(pca, data, ~augment(.x, data = .y)))
We can see that we have added one more column to our tibble containing augmented pca.
## # A tibble: 1 x 3 ## data pca pca_aug ## <list> <list> <list> ## 1 <tibble [115 × 13]> <S3: prcomp> <tibble [115 × 26]>
We can get the augmented PCA object, by unnesting it.
tidy_pca_nest %>% mutate(pca_aug = map2(pca, data, ~augment(.x, data = .y))) %>% unnest(pca_aug)
The augmented PCA object contains original data including the country_continent variable that created and all the principal components.
## # A tibble: 115 x 26 ## .rownames continent_count… lifeExp_1952 lifeExp_1957 lifeExp_1962 ## <fct> <chr> <dbl> <dbl> <dbl> ## 1 1 Africa_Algeria 43.1 45.7 48.3 ## 2 2 Africa_Angola 30.0 32.0 34
We can use the tibble to make the PCA plot between the first and second principal components. Note that the principal components are stored as .fittedPC1, ans so on.
tidy_pca_nest %>% mutate(pca_aug = map2(pca, data, ~augment(.x, data = .y))) %>% unnest(pca_aug) %>% separate(continent_country, c("continent","country")) %>% ggplot(aes(x=.fittedPC1,y=.fittedPC2)) +geom_point(aes(color=continent),size=3, alpha=0.7) + labs(x="PC1", y="PC2") + theme(legend.position="top")
By coloring the data points in PCA plot by continent, we can see the structure in the data as before. It shows the pattern that countries from Africa clusters together, the countries from Europe clusters together and asian countries are in the middle.
To summarize, we performed Principal Component Analysis (PCA) using tidyverse framework. We used functions from tidyverse’s dplyr, purrr and broom packages to do PCA on gapminder data.