PCA example using prcomp in R

PCA plot
PCA plot: PC1 vs PC2 Scatter plot

In this tutorial, we will learn how to perform PCA in R using prcomp() function in R.

Principal Component Analysis, aka, PCA is one of the commonly used approaches to do unsupervised learning/ dimensionality reduction. It is a fantastic tool to have in your data science/Machine Learning arsenal. You will be surprised how often the use of PCA pops up, whenever working with high dimensional data.

Let us see a step-by-step example of applying PCA to gapminder data in R using prcomp() and understand what PCA does.

Data for doing PCA with prcomp in R

Let us load the packages we will be using.

library(tidyverse)

We will use gapminder data in wide form to do the PCA. We will read the data from Carpentries website.

# gapminder data in wide form from Carpentries
gapminder_w_url <- "https://bit.ly/2vEDq5b"
# read the data into dataframe
gapminder_wide <- read_csv(gapminder_w_url)
# first3 rows
head(gapminder_wide, n=3)

## # A tibble: 3 x 38
##   continent country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962
##   <chr>     <chr>            <dbl>          <dbl>          <dbl>
## 1 Africa    Algeria          2449.          3014.          2551.
## 2 Africa    Angola           3521.          3828.          4269.
## 3 Africa    Benin            1063.           960.           949.
## # … with 33 more variables: gdpPercap_1967 <dbl>, gdpPercap_1972 <dbl

We will subset the gapminder data such that we keep only life expectancy data from all countries in two continents, Africa, and Europe. This will greatly help to make the interpretation of PCA from gapminder data simpler.

We first use filter statement to filter for two continents and then select continent, country and all columns starting with “lifeExp”.

gapminder_life <- gapminder_wide %>%
  filter(continent %in% c("Africa","Europe")) %>%
  select(continent,country,starts_with('lifeExp'))

On an average African countries have lower life expectancy than European countries. Since we have life expectancy data from over the years from all the countries, our PCA analysis can capture the variation in life expectancy between the two continents.

Now our data mainly contains the data we need, i.e. life expectancy for each country in from years 1952 to 2007. Let us combine the continent and country variables in the data into rownames using unite and column_to_rownames functions. Now, the dataframe contains life expectancy over time for all countries in the two continents.

gapminder_life <- gapminder_life %>% 
  unite("continent_country", c(continent,country)) %>%
  column_to_rownames("continent_country")
head(gapminder_life)

Now the dataframe only contains data and we are ready to do principal component analysis.

PCA example with prcomp

In R, we can do PCA in many ways. We will use prcomp in R to do PCA. The prcomp function takes in the data as input, and it is highly recommended to set the argument scale=TRUE. This standardize the input data so that it has zero mean and variance one before doing PCA.

pca_res <- prcomp(gapminder_life, scale=TRUE)

We have stored the results from prcomp and the resulting object has many useful variables associated with the analysis. Let us inspect the result object that we got from prcomp.

Let us first look at the summary of the prcomp result object.

summary(pca_res)

## Importance of components:
##                          PC1     PC2     PC3     PC4     PC5     PC6
## Standard deviation     3.360 0.69114 0.40463 0.19246 0.11371 0.10043
## Proportion of Variance 0.941 0.03981 0.01364 0.00309 0.00108 0.00084
## Cumulative Proportion  0.941 0.98083 0.99448 0.99756 0.99864 0.99948

The summary function on the result object gives us standard deviation, proportion of variance explained by each principal component, and the cumulative proportion of variance explained.

We can also use names function to find the name of the variables in the result object.

names(pca_res)

[1] "sdev"     "rotation" "center"   "scale"    "x"

We see that the resulting object has 5 variables. The principal components of interest are stored in x object. It will be of the same dimension as our data used for PCA. Here each column is Principal Component and each row is country.

pca_res$x[1:5,1:3]

##                            PC1        PC2        PC3
## Africa_Algeria       0.4518264 -1.3208553  0.3848907
## Africa_Angola       -5.2217443  0.2876153 -0.2574503
## Africa_Benin        -2.2956809 -0.2847236  0.0080484
## Africa_Botswana     -0.6460076  1.1788076  0.8922150
## Africa_Burkina Faso -3.3832822 -0.3287683  0.1598156

center has the mean of our data before scaling.

pca_res$center[1:5]

## lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 lifeExp_1972 
##     48.38172     50.57246     52.54620     54.26249     55.98415

The square of scale gives us the variances of each column of our data before scaling.

head(pca_res$scale^2, n=5)

## lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 lifeExp_1972 
##     181.2131     181.8949     177.7479     168.3315     157.4524

sdev gives the standard deviation of principal component. And we can use sdev to compute variance explained by each Principal Component.

var_explained <- pca_res$sdev^2/sum(pca_res$sdev^2)
var_explained[1:5]

##  [1] 9.410266e-01 3.980664e-02 1.364349e-02 3.086896e-03 1.077457e-03

Scree plot

Note that variance explained by each PC computed above is the same as the proportion of variance explained by each PC from the summary function.

Visualizing the variance explained by each component help understand more about the data. It helps us identifying visually, how many principal components are needed to explain the variation data. And the plot is widely known as Scree plot.

In our example, we see that the first principal component explains most of the variation in our data. Actually it explains 94% of the variance and the remaining 11 PCs explains the rest of the variation. It suggests that the first principal component is driving almost all of the variation in our data. In essence it helps us reducing the dimension of the data. In our example, with just one dominant principal component, we have reduced the dimension of the data from 84 x 12 to 84 x 1.

PCA plots

Let us make some plots using the principal components and see what we can learn from it. First, let us make a scatter plot using the first and second principal components.

pca_res$x %>% 
  as.data.frame %>%
  ggplot(aes(x=PC1,y=PC2)) + geom_point(size=4) +
  theme_bw(base_size=32) + 
  labs(x=paste0("PC1: ",round(var_explained[1]*100,1),"%"),
       y=paste0("PC2: ",round(var_explained[2]*100,1),"%")) +
  theme(legend.position="top")

Here we simply pulled the first two principal components from x variable from PCA results and made a scatter plot using ggplot. We also label the x and y-axis with the amount of variance explained by the two PCs.

The scatter plot immediately reveals that there are groups of countries; one clusters around 4 on x-axis (PC1) and the other clusters around -3 on x-axis. We can also see that the first group/cluster is pretty tight, while the second group has a lot of variability and spread out.

PCA Plot: PC1 vs PC2

Thinking a bit about the data we have, we can hypothesize that the clusters we see on PCA plot could be due to the differences in life expectancy between African and European countries. And we have the meta data associating each data point to a country and continent.

With that, let us remake the PCA scatter plot between the first PC and second PC, but this time let us color the data points based on the continent information. To do that we use the tidyverse functions rownames_to_column and separate the continent information from the row names of pca_res$x.

pca_res$x %>% 
  as.data.frame %>%
  rownames_to_column("continent_country") %>%
  separate(continent_country,c("continent")) %>%
  ggplot(aes(x=PC1,y=PC2)) + geom_point(aes(color=continent),size=4) +
  theme_bw(base_size=32) + 
  labs(x=paste0("PC1: ",round(var_explained[1]*100,1),"%"),
       y=paste0("PC2: ",round(var_explained[2]*100,1),"%")) +
  theme(legend.position="top")

The PCA scatter plot colored by continents clearly support our hypothesis that clusters we see on the plot is due to the differences in the life expectancies between the two continents.

An interesting thing can see in the PCA plot is that countries from each of the continent nicely clusters together. However, we see one or two data points straying a bit. For example, two African countries seem to be more similar to European countries and one European country is closer to African countries.

To identify these outliers for these two continents, we can label each point on the plot with country name using ggplot’s geom_label() function.

pca_res$x %>% 
  as.data.frame %>%
  rownames_to_column("continent_country") %>%
  separate(continent_country,c("continent", "country")) %>%
  ggplot(aes(x=PC1,y=PC2, label=country, color=continent)) +
  geom_label(aes(fill = continent), colour = "white", fontface = "bold")+
  theme_bw(base_size=32) +
  labs(x=paste0("PC1: ",round(var_explained[1]*100,1),"%"),
       y=paste0("PC2: ",round(var_explained[2]*100,1),"%"))+
  theme(legend.position="top")

Since we have labeled the country names we can see that Turkey’s life expectancy is more like some of the African countries. And the two small African island countries, Mauritius and Reunion are closer to some of the European countries.

PCA plot: First Principal Component vs Second Principal Component

To summarize, we saw a step-by-step example of PCA with prcomp in R using a subset of gapminder data. We learned the basics of interpreting the results from prcomp. Tune in for more on PCA examples with R later.

If you have this this far, you might also be interested in doing PCA using tidyverse framework.

Check out this new post on using prcomp in R to do PCA with tidyverse.

  • PCA in tidyverse