Canonical Correlation Analysis or CCA is a dimensionality reduction technique like Principal Component Analysis (PCA) or SVD. PCA/SVD aims to find the directions or projections that account for most of of the observed variance in a single high-dimensional data.
In comparison, CCA deals with two high dimensional datasets and aims to find directions or projections that account for most of co-variance between two data sets. Interestingly, CCA is also developed Hotelling in 1936 just a few years after independently developed PCA in 1933.
Intuition Behind Canonical Correlation Analysis (CCA)
In this post we will not see the math or algorithm behind Canonical Correlation Analysis. Instead, we will focus on performing CCA with R and try to understand the results. It will be definitely be interesting to implement CCA using SVD from scratch using R. But, that is for another blog plot 🙂
Let us try to get the intuition behind the use of CCA. Let us say, there is one or more variables generating two high-dimensional data sets X and Y. Here, the data sets X and Y are observables. And we don’t know about the latent variable(s) behind the two data sets. By doing CCA, we can identify the canonical variates that are highly correlated to the unknown latent variable. Basically, CCA helps us remove the noise in the two datasets and gets to the canonical variable that captures the hidden variable.
Canonical Correlation Analysis (CCA) Example in R
Let us see an example of doing CCA with penguins data first. There are a few ways we can do canonical correlation analysis in R. In this post we will use cancor() function in base R’s stat package.
Let us get started with loading tidyverse.
library(tidyverse) theme_set(theme_bw(16))
We will use Palmer Penguin data for introducing CCA in R. Let us load Palmer penguin data
link2data <- "https://raw.githubusercontent.com/cmdlinetips/data/master/palmer_penguins.csv" penguins <- read_csv(link2data) penguins <- penguins %>% drop_na()
penguins %>% head()
Preparing Two Datasets for Canonical Correlation Analysis (CCA)
We will split the penguin’s body measurements into two high-dimensional datasets. Just for illustration of CCA, we will assume species/island is the hidden variable and the two “split” body measurements are our two data matrices. In this simple example, clearly the data matrices captures the underlies the “species” variable.
And then we will perform CCA and infer canonical covariates and show that the canonical covariates captures species variable, our hidden factor.
Our data matrix X contains bill depth and bill length from the penguins data. We will also scale the variables to put them on the same scale. Here we use scale function to center and scale the columns.
X <- penguins %>% select(bill_depth_mm, bill_length_mm) %>% scale()
Our data matrix Y contains flipper length and bill length from the penguins data. We will also scale the columns in Y data matrix.
Y <- penguins %>% select(flipper_length_mm,body_mass_g) %>% scale() head(Y)
Our scaled Y matrix looks like.
flipper_length_mm body_mass_g [1,] -1.4246077 -0.5676206 [2,] -1.0678666 -0.5055254 [3,] -0.4257325 -1.1885721 [4,] -0.5684290 -0.9401915 [5,] -0.7824736 -0.6918109 [6,] -1.4246077 -0.7228585
Canonical Correlation Analysis (CCA) with cancor() function in R
As explained above, CCA aims to find the associations between two data matrices (two sets of variables) X and Y. CCA’s goal is to find the linear projection of the first data matrix that is maximally correlated with the linear projection of the second data matrix.
To perform classical CCA, we use cancor() function CCA R package. cancor() function computes canonical covariates between two input data matrices. By default cancor() centers the columns of data matrices.
library(CCA) cc_results <- cancor(X,Y)
cancor() function returns a list containing the correlation between the variables and the coefficients.
str(cc_results)
## List of 5 ## $ cor : num [1:2] 0.7876 0.0864 ## $ xcoef : num [1:2, 1:2] 0.0316 -0.0382 0.0467 0.0414 ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:2] "bill_depth_mm" "bill_length_mm" ## .. ..$ : NULL ## $ ycoef : num [1:2, 1:2] -0.0562 0.00151 -0.09748 0.11251 ## ..- attr(*, "dimnames")=List of 2 ## .. ..$ : chr [1:2] "flipper_length_mm" "body_mass_g" ## .. ..$ : NULL ## $ xcenter: Named num [1:2] 5.57e-16 3.55e-16 ## ..- attr(*, "names")= chr [1:2] "bill_depth_mm" "bill_length_mm" ## $ ycenter: Named num [1:2] 1.83e-16 -9.27e-17 ## ..- attr(*, "names")= chr [1:2] "flipper_length_mm" "body_mass_g"
Let us take a look at the coefficients from data matrix X.
cc_results$xcoef ## [,1] [,2] ## bill_depth_mm 0.03157476 0.04670337 ## bill_length_mm -0.03824761 0.04141607
Here is the coefficients from data matrix Y.
cc_results$ycoef ## [,1] [,2] ## flipper_length_mm -0.05619966 -0.09747905 ## body_mass_g 0.00151493 0.11250899
Understanding Canonical Correlation Analysis (CCA) Results
We can also check the correlations between the canonical variates. We can notice that the correlation between the first canonical variates from datasets X and Y is pretty high, suggesting that both the data sets have strong covariation.
cc_results$cor ## [1] 0.78763151 0.08638695
We can use our data sets X & Y and the corresponding coefficients to get the canonical covariate pairs. In the code below, we perform matrix multiplication with each data sets and its first (and second separately) coefficient column to get the first canonical covariate pairs.
CC1_X <- as.matrix(X) %*% cc_results$xcoef[, 1] CC1_Y <- as.matrix(Y) %*% cc_results$ycoef[, 1]
CC2_X <- as.matrix(X) %*% cc_results$xcoef[, 2] CC2_Y <- as.matrix(Y) %*% cc_results$ycoef[, 2]
We can also get all pairs of canonical covariates by multiplying data with the coefficient matrix instead of multiplying one by one.
Let us look at the first pair of canonical covariates we computed. We can compute the correlation between the first pair of canonical covariates and it is the same as correlation we get as results from cancor() function’s cor.
cor(CC1_X,CC1_Y) ## [,1] ## [1,] 0.7876315
Here we verify the the correlation we computed between the first pair of canonical covariates is the same as cancor’s cor results.
assertthat::are_equal(cc_results$cor[1], cor(CC1_X,CC1_Y)[1]) ## [1] TRUE
Now that we have done canonical correlation analysis, let us dig deeper to understand the canonical covariate pair we got as results.
In this toy example, we kind of know that two sets of measures we have as the two data matrices came from the same group of penguins. And we kind of suspected earlier the differences in these measurements are due to penguin species differences. Therefore, a common latent variable behind these two measurements is species variable. And our CCA analysis’ main goal is to capture the common variable. We also saw that the first pair of canonical variate is highly correlated.
Let us check if that the canonical covariate is actually species variable. First, let us create a dataframe with the penguins data and the first pair of canonical covariates.
cca_df <- penguins %>% mutate(CC1_X=CC1_X, CC1_Y=CC1_Y, CC2_X=CC2_X, CC2_Y=CC2_Y)
Let us make a scatter plot between the first pair of canonical covariates. We can see that they both are clearly correlated.
cca_df %>% ggplot(aes(x=CC1_X,y=CC1_Y))+ geom_point()
To see if each of canonical variate is correlated with species variable in the penguin’s dataset, we make a boxplot between canonical covariate and the species.
cca_df %>% ggplot(aes(x=species,y=CC1_X, color=species))+ geom_boxplot(width=0.5)+ geom_jitter(width=0.15)+ theme(legen.position="none")
It is clear from boxplots that the first pair of canonical covariate is highly correlated with species.
cca_df %>% ggplot(aes(x=species,y=CC1_Y, color=species))+ geom_boxplot(width=0.5)+ geom_jitter(width=0.15)
We could have come to same conclusion by coloring the scatter plot between the first pair of canonical covariates by species variable.
cca_df %>% ggplot(aes(x=CC1_X,y=CC1_Y, color=species))+ geom_point()
In this toy example for illustrating CCA, we know of the latent variable, i.e. species, beforehand. However, in a real world data we may no know the latent variable and CCA informs us that our two datasets actually came from three groups/clusters.
Let us try to understand the meaning behind the second pair of canonical covarites. We will make a scatterplot of the second pair of canonical covariates. We know from the correlation values, the second pair is not that highly correlated. In our penguin data, we have sex variable that is common to the body measurements. We can hypothesize that the second pair of canonical covariate could have captured the effect of sex in the datasets. To verify let us make scatter plot between the second pair of canonical covariates and color the data points by sex.
cca_df %>% ggplot(aes(x=CC2_X,y=CC2_Y, color=sex))+ geom_point()
We can see the modest effect of sex on the data is captured by the second pair of canonical covariates.
[…] you are interested in a bit of history CCA is originally developed by the same Hotelling who developed PCA in the […]