Dimension Reduction techniques are one of the most useful methods in unsupervised learning of high dimensional datasets. In this post, we will learn how to use Python to perform 7 most commonly used dimensionality reduction techniques by example,
- PCA: Principal Component Analysis
- SVD: Singular Value Decomposition
- ICA: Independent Component Analysis
- NMF: Non-negative Matrix Factorization
- FA: Factor Analysis
- tSNE
- UMAP
We will use the Palmer Penguin’s data for performing these 7 dimensionality reduction techniques. One of the interesting things about working with the same data for all the different dimensionality reduction techniques is that we can see how the results are similar when there is a strong pattern in the data like palmer penguin dataset.
In case you interested in performing dimensionality reduction techniques in R by examples check out this post: 6 Dimensionality Reduction Techniques in R (with Examples)
Loading the Python Modules needed
Let us get started by loading the Python packages needed.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import numpy as np
We will mainly rely on Scikit-learn for most of the techniques except for SVD and UMAP. Let us also import the Sckit’s preprocessing modules.
from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA, KernelPCA, FastICA, NMF, FactorAnalysis
Seaborn has the palmer penguin data set available and we will load the dataset using Seaborn’s load_dataset() function.
penguins = sns.load_dataset("penguins")
Preparing the penguins data for dimensionality reduction analysis
Since most of these dimensionality reduction techniques do not like missing values, let us remove any row containing missing values.
penguins = (penguins. dropna())
And this is how our data looks like.
penguins.head() species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex 0 Adelie Torgersen 39.1 18.7 181.0 3750.0 Male 1 Adelie Torgersen 39.5 17.4 186.0 3800.0 Female 2 Adelie Torgersen 40.3 18.0 195.0 3250.0 Female 4 Adelie Torgersen 36.7 19.3 193.0 3450.0 Female 5 Adelie Torgersen 39.3 20.6 190.0 3650.0 Male
We already kind of know that in Palmer penguin dataset the variables, species and sex are the main drivers of the variation in other features like body mass and other. Our goal in performing these dimensionlity reduction techniques is to assess how well they are captured by the first two latent variables from the methods.
data = (penguins. select_dtypes(np.number) )
data.head() bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 0 39.1 18.7 181.0 3750.0 1 39.5 17.4 186.0 3800.0 2 40.3 18.0 195.0 3250.0 4 36.7 19.3 193.0 3450.0 5 39.3 20.6 190.0 3650.0
1. PCA Example in Python
To perform Principal Component Analysis (PCA), we can use sckit-learn’s PCA() method. Here we use make_pipeline() function in sklearn to first normalize the data using StandardScaler() and then apply PCA to get the first two Princiapl Components (PCs).
random_state = 0 pca_pl = make_pipeline( StandardScaler(), PCA(n_components = 2, random_state = random_state) )
We can apply the pipeline on the data to get PCs. PCs from sklearn’s PCA are stored as NumPy arrays.
pcs = pca_pl.fit_transform(data) pcs[0:5,:] array([[-1.85359302, 0.03206938], [-1.31625406, -0.44352677], [-1.37660509, -0.16123048], [-1.88528838, -0.01235124], [-1.91998074, 0.81759813]])
Here save PCs in a Pandas dataframe with penguin species and sex data.
pcs_df = pd.DataFrame(data = pcs , columns = ['PC1', 'PC2']) pcs_df['Species'] = penguins.species.values pcs_df['Sex'] = penguins.sex.values pcs_df.head() PC1 PC2 Species Sex 0 -1.853593 0.032069 Adelie Male 1 -1.316254 -0.443527 Adelie Female 2 -1.376605 -0.161230 Adelie Female 3 -1.885288 -0.012351 Adelie Female 4 -1.919981 0.817598 Adelie Male
Now we can make PCA plot with PC1 on x-axis and PC2 on y-axis colored by species and using shapes for sex variables.
plt.figure(figsize=(12,10)) with sns.plotting_context("talk",font_scale=1.25): sns.scatterplot(x="PC1", y="PC2", data=pcs_df, hue="Species", style="Sex", s=100) plt.xlabel("PC1") plt.ylabel("PC2") plt.title("PCA", size=24) plt.savefig("PCA_Example_in_Python.png", format='png',dpi=150)
We can see that species variable drives most of the variation as PC1 the first principal component that explains most of the variation nicely separates Gentoo from the other two species. We can also see PC2 has captured the effect Sex nicely.
2. Singular Value Decomposition (SVD) Example in Python
Singular Value Decomposition SVD, is another commonly use Dimensionality Reduction technique and it is directly related to PCA. In Python, we can NumPy’s linalg module to perform SVD. First we use StandardScaler() to normalize the data and get scaled data.
scaler = StandardScaler() # transform data scaled = scaler.fit_transform(data)
Then we used the normalized/scaled data to perform SVD. The result is three matrices U, S and V
U, S, V = np.linalg.svd(scaled, full_matrices=True)
The Matrix U contains our singular values or PCs.
U array([[-0.06130458, -0.00199226, 0.02120124, ..., 0.05959822, 0.0091116 , 0.04840103], [-0.04353297, 0.0275534 , 0.00247933, ..., -0.02109425, -0.06389134, -0.01298945], [-0.04552898, 0.01001619, -0.01712056, ..., -0.03946717, -0.01007025, 0.00511708], ..., [ 0.09100157, -0.01655935, 0.03784485, ..., 0.98856621, -0.00729967, -0.00928874], [ 0.05668293, 0.04509385, 0.02371596, ..., -0.00741851, 0.99157553, -0.0067731 ], [ 0.06675983, -0.02090787, 0.0140195 , ..., -0.00978679, -0.00674207, 0.99012727]])
As before we save two singular values as pandas dataframe with species and sex variables.
SVs_df = pd.DataFrame(data = U[:,0:2] , columns = ['SV1', 'SV2']) SVs_df['Species'] = penguins.species.values SVs_df['Sex'] = penguins.sex.values SVs_df.head() SV1 SV2 Species Sex 0 -0.061305 -0.001992 Adelie Male 1 -0.043533 0.027553 Adelie Female 2 -0.045529 0.010016 Adelie Female 3 -0.062353 0.000767 Adelie Female 4 -0.063500 -0.050792 Adelie Male
When we make scatter plot between the two singular values and colors/shape to highlight the latent variables that these singular values have captured, we can see how similar they look like to PCA plot earlier.
plt.figure(figsize=(12,10)) with sns.plotting_context("talk",font_scale=1.25): sns.scatterplot(x="SV1", y="SV2", data=SVs_df, hue="Species", style="Sex", s=100) plt.xlabel("SV1") plt.ylabel("SV2") plt.title("SVD", size=32) plt.savefig("SVD_Example_in_Python.png", format='png',dpi=150)
3. ICA: Independent Component Analysis Example in Python
Independent Component Analysis or ICA is another commonly used dimensionality reduction technique. ICA aims to estimate the independent components (IC) that are independent of each other.
In Python we can use FastICA() from scikit-learn top perform ICA. In the example below we use the scaled data to do ICA and get two independent components.
transformer = FastICA(n_components=2, random_state=0) ICA_fit = transformer.fit_transform(scaled)
Then we save the ICs in a Pandas dataframe with both species and sex variable from the dataset.
ICs_df = pd.DataFrame(data = ICA_fit , columns = ['IC1', 'IC2']) ICs_df['Species'] = penguins.species.values ICs_df['Sex'] = penguins.sex.values ICs_df.head() IC1 IC2 Species Sex 0 0.056393 0.024125 Adelie Male 1 0.027792 0.043381 Adelie Female 2 0.037019 0.028334 Adelie Female 3 0.056176 0.027069 Adelie Female 4 0.079024 -0.019165 Adelie Male
We can understand what the ICs have captured by by making the scatter plot between the ICs and add color/shape to species/sex variables.
plt.figure(figsize=(12,10)) with sns.plotting_context("talk",font_scale=1.25): sns.scatterplot(x="IC1", y="IC2", data=ICs_df, hue="Species", style="Sex", s=100) plt.xlabel("IC1") plt.ylabel("IC2") plt.title("ICA") plt.savefig("ICA_Example_in_Python.png", format='png',dpi=150)
We can see that the two ICs have nicely captured the variation driven by species and sex variables.
4. NMF: Non-Negative Matrix Factorization in Python
Non-negative Matrix Factorization (NMF), works with input data matrix that contains non-negative values and performs matrix factorization. Our scaled data obtained by StandardScaler() has negative values. So we will use the min-max scaler normalization that transforms the data 0-1 range and use that for performing NMF.
MMscaler = MinMaxScaler() # transform data MM_scaled = MMscaler.fit_transform(data)
nmf_model = NMF(n_components=2, init='random', max_iter=500, random_state=0) W = nmf_model.fit_transform(MM_scaled) H = nmf_model.components_
We used NMF() function available in scikit-learn and got two factors from the input data. Then we save the results as Pandas dataframe with species and sex values.
NMFs_df = pd.DataFrame(data = W , columns = ['NMF1', 'NMF2']) NMFs_df['Species'] = penguins.species.values NMFs_df['Sex'] = penguins.sex.values NMFs_df.head()
NMF1 NMF2 Species Sex 0 0.127601 0.518228 Adelie Male 1 0.198881 0.397550 Adelie Female 2 0.203366 0.454378 Adelie Female 3 0.148316 0.556137 Adelie Female 4 0.146132 0.685184 Adelie Male
The scatter oplot similar to PCA plot reveals the nice structure captured by NMF and the species and sex variables account for hidden structure obtained in unsupervised manner.
plt.figure(figsize=(12,10)) with sns.plotting_context("talk",font_scale=1.25): sns.scatterplot(x="NMF1", y="NMF2", data=NMFs_df, hue="Species", style="Sex", s=100) plt.xlabel("NMF1") plt.ylabel("NMF2") plt.title("NMF") plt.savefig("NMF_Example_in_Python.png", format='png',dpi=150)
5. Factor Analysis
Factor analysis, another common dimensionality reduction technique, aims to capture the hidden factors in the data as factors.
In Python we can perform Factor Analysis using the FactorAnalysis() function available in scikit-learn.
transformer = FactorAnalysis(n_components=2, random_state=0) FA_fit= transformer.fit_transform(scaled) FA_fit.shape
FAs_df = pd.DataFrame(data = FA_fit , columns = ['FA1', 'FA2']) FAs_df['Species'] = penguins.species.values FAs_df['Sex'] = penguins.sex.values FAs_df.head() FA1 FA2 Species Sex 0 -1.329200 0.051197 Adelie Male 1 -0.983690 -0.415561 Adelie Female 2 -0.530587 -0.066072 Adelie Female 3 -0.683030 0.384215 Adelie Female 4 -0.857335 1.050659 Adelie Male
By plotting the two factors as scatter plot, we see the familiar pattern driven by species and sex and captured by the latent factors inferred.
plt.figure(figsize=(12,10)) with sns.plotting_context("talk",font_scale=1.25): sns.scatterplot(x="FA1", y="FA2", data=FAs_df, hue="Species", style="Sex", s=100) plt.xlabel("Factor 1") plt.ylabel("Factor 2") plt.title("Factor Analysis") plt.savefig("FactorAnalysis_Example_in_Python.png", format='png',dpi=150)
6. tSNE Example in Python
t-Distributed Stochastic Neighbor Embedding, t-SNE is a relatively new dimensionality reduction technique that is commonly used for visualizing high dimensional datasets. Unlike PCA/ICA/NMF, tSNE is a non-linear dimension reduction technique.
We can perform tSNE analysis using scikit-learn’s TSNE module as shown below.
from sklearn.manifold import TSNE tsne = TSNE(n_components=2, random_state=0)
tsne_fit= tsne.fit_transform(scaled) tsne_fit[0:5,0:2] array([[ 4.392819 , -9.887315 ], [ 6.0089364, -12.803539 ], [ 6.9806666, -15.289839 ], [ 7.6817703, -7.7815695], [ 5.2514114, -3.143555 ]], dtype=float32)
We save the tSNE results as a dataframe with Palmer penguin species names and sex variable.
tSNE_df = pd.DataFrame(data = tsne_fit , columns = ['tSNE1', 'tSNE2']) tSNE_df['Species'] = penguins.species.values tSNE_df['Sex'] = penguins.sex.values tSNE_df.head() tSNE1 tSNE2 Species Sex 0 4.392819 -9.887315 Adelie Male 1 6.008936 -12.803539 Adelie Female 2 6.980667 -15.289839 Adelie Female 3 7.681770 -7.781569 Adelie Female 4 5.251411 -3.143555 Adelie Male
tSNE plot made with the two components, colored by species and sex nicely show the structure in the four features we were using to perform tSNE.
plt.figure(figsize=(12,10)) with sns.plotting_context("talk",font_scale=1.25): sns.scatterplot(x="tSNE1", y="tSNE2", data=tSNE_df, hue="Species", style="Sex", s=100) plt.xlabel("tSNE1") plt.ylabel("tSNE2") plt.title("tSNE") plt.savefig("tSNE_Example_in_Python.png", format='png',dpi=150)
7. UMAP Example in Python
UMAP aka Uniform Manifold Approximation and Projection for Dimension Reduction that is similar in flavor as tSNE. UMAP is also a non-linear dimension reduction technique and commonly used for visualisation high dimensional data.
Currently scikit-learn does not have UMAP. One has to install separately from Conda or pip.
import umap reducer = umap.UMAP()
umap_fit = reducer.fit_transform(scaled) umap_fit.shape (333, 2)
umap_df = pd.DataFrame(data = umap_fit , columns = ['UMAP1', 'UMAP2']) umap_df['Species'] = penguins.species.values umap_df['Sex'] = penguins.sex.values umap_df.head()
UMAP1 UMAP2 Species Sex 0 11.074565 8.976759 Adelie Male 1 11.175638 10.158457 Adelie Female 2 10.951249 10.764420 Adelie Female 3 10.011297 8.642998 Adelie Female 4 10.638319 7.325177 Adelie Male
plt.figure(figsize=(12,10)) with sns.plotting_context("talk",font_scale=1.25): sns.scatterplot(x="UMAP1", y="UMAP2", data=umap_df, hue="Species", style="Sex", s=100) plt.xlabel("UMAP1") plt.ylabel("UMAP2") plt.title("UMAP") plt.savefig("UMAP_Example_in_Python.png", format='png',dpi=150)
One of the biggest highlights looking at the results as simple scatter plot between the first/top 2 components from 7 different dimensionality reduction techniques is that how similar the results are. This is true when a simpler data set like this with strong latent variables (like species and sex) all of the methods perform nicely and captured the latent factors.