tidymodels, is one of the new suite of packages for doing machine learning analysis in R with tidy principles from RStudio.
Tidymodels, the metapackage, has a core set of packages for statistical/machine learning models like infer, parsnip, recipes, rsample, and dials in addition to the core tidyverse packages dplyr, ggplot2, purr, and broom.
In addition to providing tidy framework for common supervised learning algorithms, tidymodels also supports unsupervised learning algorithms like Principal Component Analysis (PCA).
Finally, got a chance to play with tidymodels. What better way to start tidymodels than with Principal Component Analysis (PCA). In this post we will see an example of doing PCA analysis using tidymodels. Under the hood, tidymodels uses prcomp in R for doing PCA.
Let us get started by loading tidymodels and tidyverse.
library(tidymodels) library(tidyverse) library(gapminder) theme_set(theme_bw(16))
We will use gapminder dataset to do PCA.
head(gapminder) ## # A tibble: 6 x 6 ## country continent year lifeExp pop gdpPercap ## <fct> <fct> <int> <dbl> <int> <dbl> ## 1 Afghanistan Asia 1952 28.8 8425333 779. ## 2 Afghanistan Asia 1957 30.3 9240934 821. ## 3 Afghanistan Asia 1962 32.0 10267083 853. ## 4 Afghanistan Asia 1967 34.0 11537966 836. ## 5 Afghanistan Asia 1972 36.1 13079460 740. ## 6 Afghanistan Asia 1977 38.4 14880372 786.
We will use lifeExp values over the years from all the countries to do the PCA. To do that we will first convert the gapminder data in tidy/long form to wide form. We will use tidyr’s pivot_wider() to reshape the data.
life_df <- gapminder %>% select(country, continent, year, lifeExp) %>% pivot_wider(values_from = lifeExp, names_from = year)
Now we have our life expectancy data in wide form with country and continent information. Note that the years are column names and the lifeExp values are rows now.
life_df %>% head() # # A tibble: 142 x 14 ## country continent `1952` `1957` `1962` `1967` `1972` `1977` `1982` `1987` ## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Afghan… Asia 28.8 30.3 32.0 34.0 36.1 38.4 39.9 40.8 ## 2 Albania Europe 55.2 59.3 64.8 66.2 67.7 68.9 70.4 72 ## 3 Algeria Africa 43.1 45.7 48.3 51.4 54.5 58.0 61.4 65.8 ## 4 Angola Africa 30.0 32.0 34 36.0 37.9 39.5 39.9 39.9
The first step in using tidymodels is to define a pipeline, i.e. a series of analysis steps, that we want to perform. R package recipies, a part of tidymodels lets you build a recipe (or a pipeline) for the analysis.
Let us start a recipe by providing the data we will be using first using recipe() function. Here we simply creating a new pipleine with reference to data.
pca_recipe <- recipe(~., data = life_df)
Now we have a bare bone recipe just pointing to the data. And then we can build a recipe with series of steps that we typically do in the analysis. Our simple recipe to do PCA analysis involve three steps.
In the case of PCA, we typically center the data. We can center the data, here columns, using step_center() function. In this example, we specify to center all numeric columns.
Next, we scale the data (columns again) using step_scale() function all numeric variables in the data.
And finally, we specify that we want to do PCA using step_pca().
pca_trans <- pca_recipe %>% # center the data step_center(all_numeric()) %>% # center the data step_scale(all_numeric()) %>% # pca on all numeric variables step_pca(all_numeric())
We have saved the recipe with the steps in a variable called pca_trans. So far we have not done any computation as such. We have just specified list of things to do. We can check the variable and learn a bit more about what we have specified.
pca_trans ## Data Recipe ## ## Inputs: ## ## role #variables ## predictor 14 ## ## Operations: ## ## Centering for all_numeric ## Scaling for all_numeric ## No PCA components were extracted.
Now we are ready for actual analysis. We will use the recipe we built so far as argument to the prep() function.
pca_estimates <- prep(pca_trans)
Now tidymodels actually performs the operations that we specified before. We can see from the message below that it centered the column variables, scaled them and finally performed PCA.
pca_estimates ## Data Recipe ## ## Inputs: ## ## role #variables ## predictor 14 ## ## Training data contained 142 data points and no missing data. ## ## Operations: ## ## Centering for 1952, 1957, 1962, 1967, 1972, 1977, 1982, ... [trained] ## Scaling for 1952, 1957, 1962, 1967, 1972, 1977, 1982, ... [trained] ## PCA extraction with 1952, 1957, 1962, 1967, 1972, 1977, 1982, ... [trained]
We stored the result of the PCA analysis in the variable pca_estimates and it contains lot of information about what we have done so far.
We can check the R objects content with names() and check on each of them.
names(pca_estimates) ## [1] "var_info" "term_info" "steps" "template" ## [5] "levels" "retained" "tr_info" "orig_lvls" ## [9] "last_term_info"
For example, we can check var_info and learn that it contains the information about the data we used.
pca_estimates$var_info # # A tibble: 14 x 4 ## variable type role source ## <chr> <chr> <chr> <chr> ## 1 country nominal predictor original ## 2 continent nominal predictor original ## 3 1952 numeric predictor original ## 4 1957 numeric predictor original
Our results from PCA recipe steps are available in “steps”. In our case we used three steps, two for normalizing the data and one for doing PCA. Therefore, the “steps” object is a list of size three, where the first two elements contain details on our normalization steps and the third element contains the results from PCA.
Let us check the first step and we can see that it contains information about mean centering step that we did.
pca_estimates$steps[[1]] ## $terms ## <list_of<quosure>> ## ## [[1]] ## <quosure> ## expr: ^all_numeric() ## env: 0x7ff0c4133af0 ## ## ## $role ## [1] NA ## ## $trained ## [1] TRUE ## ## $means ## 1952 1957 1962 1967 1972 1977 1982 1987 ## 49.05762 51.50740 53.60925 55.67829 57.64739 59.57016 61.53320 63.21261 ## 1992 1997 2002 2007 ## 64.16034 65.01468 65.69492 67.00742 ##
Let us access the third element to get the standard deviations from the PCA analysis and use them to compute percentage of variation explained by each PC.
# std deviation sdev <- pca_estimates$steps[[3]]$res$sdev
We can compute percentage of variation explained by each PC as shown below.
percent_variation <- sdev^2 / sum(sdev^2)
And now we can make scree plot to visualize the contribution of each PC in terms of the variance explained by PCs. Let us first create dataframe with varince explained by each PC.
var_df <- data.frame(PC=paste0("PC",1:length(sdev)), var_explained=percent_variation, stringsAsFactors = FALSE)
Then make a scree plot using geom_col().
var_df %>% mutate(PC = fct_inorder(PC)) %>% ggplot(aes(x=PC,y=var_explained))+geom_col()
We can see that the first PC explains most of the variation in the data.
Let us take a look at the Principal components computed by our recipe. Tidymodels’ recipes’ package has a function called “juice” that will return the results of a recipe created by prep.
juice(pca_estimates)
We see that juice gives us PCs with associated variables from our data.
## # A tibble: 142 x 7 ## country continent PC1 PC2 PC3 PC4 PC5 ## <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Afghanistan Asia -6.58 -0.377 -0.171 -0.0120 -0.0782 ## 2 Albania Europe 2.68 0.0350 0.0438 0.275 -0.0371 ## 3 Algeria Africa -0.131 1.21 0.207 -0.00852 0.106 ## 4 Angola Africa -6.46 -0.718 -0.235 0.0682 -0.182 ## 5 Argentina Americas 2.85 -0.362 -0.247 -0.128 0.0295 ## 6 Australia Oceania 4.52 -0.325 -0.304 -0.213 0.0728 ## 7 Austria Europe 4.05 -0.280 -0.270 -0.152 0.0329 ## 8 Bahrain Asia 1.85 0.712 0.299 0.0927 0.0428 ## 9 Bangladesh Asia -2.92 0.879 -0.446 -0.111 -0.0267 ## 10 Belgium Europe 4.22 -0.439 -0.244 -0.188 0.0395 ## # … with 132 more rows
We can use the results from juice to make standard PCA plot, scatter plot with PC1 on x-axis and PC2 on y-axis to see the structure in the data.
juice(pca_estimates) %>% ggplot(aes(PC1, PC2)) + geom_point(aes(color = continent), alpha = 0.7, size = 2)+ labs(title="PCA from tidymodels")
The first principal component captures the difference in life Expectancy across continents. We can add labels to the PCA plot to see the names of continents with geom_text().
juice(pca_estimates) %>% #separate(country_continent,c("country","continent"), sep="_") %>% ggplot(aes(PC1, PC2, label = continent)) + geom_point(aes(color = continent), alpha = 0.7, size = 2,show.legend = FALSE) + geom_text(check_overlap = TRUE) + labs(title="PCA with Tidymodels")
PCA with tidymodels was fun. Looking forward to playing with tidymodels more. Want to learn more on tidymodels? Check out the videocasts of awesome Julia Silge from RStudio.