Dimensionality reduction techniques like PCA, SVD, tSNE, UMAP are fantastic toolset to perform exploratory data analysis and unsupervised learning with high dimensional data. It has become really easy to use many available dimensionality reduction techniques in both R and Python while doing data science. However, often it can be little bit challenging to interpret low dimensions we get after reducing dimensions and use the dimensionality reduction techniques effectively.
Lan Huong Nguyen and Susan Holmes, statistics professor at Stanford and the author of fantastic new book on modern statistics, have published an article “Ten quick tips for effective dimensionality reduction“. The article is freely available from PLoS journal.
There are many gems in the 10 tips that are useful for those are new to dimensionality reduction techniques and the ones who use them a lot.
I just had a chance to go through the article and here are the three tips from the article that I liked the most. Not because others were not good, but these three are the ones that never taught commonly, but extremely useful.
Correct aspect ratio for your visualizations
One of the tips I love the most is tip 6 on the use of aspect ratio to visualize the reduced dimensions of the data
Two-dimensional PCA plots with equal height and width are misleading but frequently encountered because popular software programs for analyzing biological data often produce square (2D) or cubical (3D) graphics by default. Instead, the height-to-width ratio of a PCA plot should be consistent with the ratio between the corresponding eigenvalues. Because eigenvalues reflect the variance in coordinates of the associated PCs, you only need to ensure that in the plots, one “unit” in direction of one PC has the same length as one “unit” in direction of another PC. (If you use ggplot2 R package for generating plots, adding + coords_fixed(1) will ensure a correct aspect ratio.)
Integrating Multi-Modal Datasets
Another one I like the most, but commonly overlooked for many reasons is tip 9 dealing with multiple high dimensional datasets from the same system. Increasingly, we are having not one high dimensional dataset, but many high dimensional data sets from the same set of samples. For example, you may have both audio and video data sets from the same set of individuals and you are interested in capturing the commonalities between the datasets. Techniques like Canonical Correlation Analysis can be useful in such scenarios. Tip 9, runs through a simple example of integrating 5 different high dimensional datasets from the same system using DiSTATIS is a must read.
Quantifying Uncertainties in Principal Components
Another one that is very useful is tip 10 that emphasizes on quantifying the uncertainty in principal components. One my pet peeves when people using PCs in their analysis is treating all PCs the same way ignoring the percentage of variance explained by each component. Scree plot is great way visualize the importance of each principal component (See tip 5). often
For some datasets, the PCA PCs are ill defined, i.e., two or more successive PCs may have very similar variances, and the corresponding eigenvalues are almost exactly the same
One of the ways to deal with such scenarios and estimate uncertainties is to use bootstrap techniques, i.e.use
random subsets of the data generated by resampling observations with replacement.
Quantifying uncertainties in PCs is an important topic, but you see it rarely. It was really great to see that nicely explained with an example.
If you are curious on other tips for effective dimensionality reduction, here it is.
- Tip 1: Choose an appropriate method
- Tip 2: Preprocess continuous and count input data
- Tip 3: Handle categorical input data appropriately
- Tip 4: Use embedding methods for reducing similarity and dissimilarity input data
- Tip 5: Consciously decide on the number of dimensions to retain
- Tip 6: Apply the correct aspect ratio for your visualizations
- Tip 7: Understand the meaning of the new dimensions
- Tip 8: Find the hidden signal
- Tip 9: Take advantage of multidomain data
- Tip 10: Check the robustness of your results and quantify uncertainties
Interested in dabbling with the examples of these 10 tips? No worries, check out the Rmarkdown illustrating all the 10 tips for effective dimensionality reduction available as supplement to the article.