7 Dimensionality Reduction Techniques by Examples in Python

Dimension Reduction techniques are one of the most useful methods in unsupervised learning of high dimensional datasets. In this post, we will learn how to use Python to perform 7 most commonly used dimensionality reduction techniques by example,

PCA: Principal Component Analysis
SVD: Singular Value Decomposition
ICA: Independent Component Analysis
NMF: Non-negative Matrix Factorization
FA: Factor Analysis
tSNE
UMAP

We will use the Palmer Penguin’s data for performing these 7 dimensionality reduction techniques. One of the interesting things about working with the same data for all the different dimensionality reduction techniques is that we can see how the results are similar when there is a strong pattern in the data like palmer penguin dataset.

In case you interested in performing dimensionality reduction techniques in R by examples check out this post: 6 Dimensionality Reduction Techniques in R (with Examples)

Loading the Python Modules needed

Let us get started by loading the Python packages needed.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

We will mainly rely on Scikit-learn for most of the techniques except for SVD and UMAP. Let us also import the Sckit’s preprocessing modules.

from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler

from sklearn.decomposition import PCA, KernelPCA, FastICA, NMF, FactorAnalysis

Seaborn has the palmer penguin data set available and we will load the dataset using Seaborn’s load_dataset() function.

penguins = sns.load_dataset("penguins")

Preparing the penguins data for dimensionality reduction analysis

Since most of these dimensionality reduction techniques do not like missing values, let us remove any row containing missing values.

penguins = (penguins.
            dropna())

And this is how our data looks like.

penguins.head()

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
0	Adelie	Torgersen	39.1	18.7	181.0	3750.0	Male
1	Adelie	Torgersen	39.5	17.4	186.0	3800.0	Female
2	Adelie	Torgersen	40.3	18.0	195.0	3250.0	Female
4	Adelie	Torgersen	36.7	19.3	193.0	3450.0	Female
5	Adelie	Torgersen	39.3	20.6	190.0	3650.0	Male

We already kind of know that in Palmer penguin dataset the variables, species and sex are the main drivers of the variation in other features like body mass and other. Our goal in performing these dimensionlity reduction techniques is to assess how well they are captured by the first two latent variables from the methods.

data = (penguins.
        select_dtypes(np.number)
       )

data.head()

bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	39.1	18.7	181.0	3750.0
1	39.5	17.4	186.0	3800.0
2	40.3	18.0	195.0	3250.0
4	36.7	19.3	193.0	3450.0
5	39.3	20.6	190.0	3650.0

1. PCA Example in Python

To perform Principal Component Analysis (PCA), we can use sckit-learn’s PCA() method. Here we use make_pipeline() function in sklearn to first normalize the data using StandardScaler() and then apply PCA to get the first two Princiapl Components (PCs).

random_state = 0
pca_pl = make_pipeline(
    StandardScaler(),
    PCA(n_components = 2,
        random_state = random_state)
)

We can apply the pipeline on the data to get PCs. PCs from sklearn’s PCA are stored as NumPy arrays.

pcs = pca_pl.fit_transform(data)
pcs[0:5,:]

array([[-1.85359302,  0.03206938],
       [-1.31625406, -0.44352677],
       [-1.37660509, -0.16123048],
       [-1.88528838, -0.01235124],
       [-1.91998074,  0.81759813]])

Here save PCs in a Pandas dataframe with penguin species and sex data.

pcs_df = pd.DataFrame(data = pcs , 
        columns = ['PC1', 'PC2'])
pcs_df['Species'] = penguins.species.values
pcs_df['Sex'] = penguins.sex.values
pcs_df.head()

             PC1	PC2	Species	Sex
0	-1.853593	0.032069	Adelie	Male
1	-1.316254	-0.443527	Adelie	Female
2	-1.376605	-0.161230	Adelie	Female
3	-1.885288	-0.012351	Adelie	Female
4	-1.919981	0.817598	Adelie	Male

Now we can make PCA plot with PC1 on x-axis and PC2 on y-axis colored by species and using shapes for sex variables.

plt.figure(figsize=(12,10))
with sns.plotting_context("talk",font_scale=1.25):
    sns.scatterplot(x="PC1", y="PC2",
                    data=pcs_df, 
                    hue="Species",
                    style="Sex",
                    s=100)
    plt.xlabel("PC1")
    plt.ylabel("PC2")
    plt.title("PCA", size=24)
plt.savefig("PCA_Example_in_Python.png",
                    format='png',dpi=150)

We can see that species variable drives most of the variation as PC1 the first principal component that explains most of the variation nicely separates Gentoo from the other two species. We can also see PC2 has captured the effect Sex nicely.

2. Singular Value Decomposition (SVD) Example in Python

Singular Value Decomposition SVD, is another commonly use Dimensionality Reduction technique and it is directly related to PCA. In Python, we can NumPy’s linalg module to perform SVD. First we use StandardScaler() to normalize the data and get scaled data.

scaler = StandardScaler()
# transform data
scaled = scaler.fit_transform(data)

Then we used the normalized/scaled data to perform SVD. The result is three matrices U, S and V

U, S, V = np.linalg.svd(scaled, full_matrices=True)

The Matrix U contains our singular values or PCs.

U

array([[-0.06130458, -0.00199226,  0.02120124, ...,  0.05959822,
         0.0091116 ,  0.04840103],
       [-0.04353297,  0.0275534 ,  0.00247933, ..., -0.02109425,
        -0.06389134, -0.01298945],
       [-0.04552898,  0.01001619, -0.01712056, ..., -0.03946717,
        -0.01007025,  0.00511708],
       ...,
       [ 0.09100157, -0.01655935,  0.03784485, ...,  0.98856621,
        -0.00729967, -0.00928874],
       [ 0.05668293,  0.04509385,  0.02371596, ..., -0.00741851,
         0.99157553, -0.0067731 ],
       [ 0.06675983, -0.02090787,  0.0140195 , ..., -0.00978679,
        -0.00674207,  0.99012727]])

As before we save two singular values as pandas dataframe with species and sex variables.

SVs_df = pd.DataFrame(data = U[:,0:2] , 
        columns = ['SV1', 'SV2'])
SVs_df['Species'] = penguins.species.values
SVs_df['Sex'] = penguins.sex.values
SVs_df.head()

SV1	SV2	Species	Sex
0	-0.061305	-0.001992	Adelie	Male
1	-0.043533	0.027553	Adelie	Female
2	-0.045529	0.010016	Adelie	Female
3	-0.062353	0.000767	Adelie	Female
4	-0.063500	-0.050792	Adelie	Male

When we make scatter plot between the two singular values and colors/shape to highlight the latent variables that these singular values have captured, we can see how similar they look like to PCA plot earlier.

plt.figure(figsize=(12,10))
with sns.plotting_context("talk",font_scale=1.25):
    sns.scatterplot(x="SV1", y="SV2",
                    data=SVs_df, 
                    hue="Species",
                    style="Sex",
                    s=100)
    plt.xlabel("SV1")
    plt.ylabel("SV2")
    plt.title("SVD", size=32)
plt.savefig("SVD_Example_in_Python.png",
                    format='png',dpi=150)

3. ICA: Independent Component Analysis Example in Python

Independent Component Analysis or ICA is another commonly used dimensionality reduction technique. ICA aims to estimate the independent components (IC) that are independent of each other.

In Python we can use FastICA() from scikit-learn top perform ICA. In the example below we use the scaled data to do ICA and get two independent components.

transformer = FastICA(n_components=2,
                      random_state=0)
ICA_fit = transformer.fit_transform(scaled)

Then we save the ICs in a Pandas dataframe with both species and sex variable from the dataset.

ICs_df = pd.DataFrame(data = ICA_fit , 
        columns = ['IC1', 'IC2'])
ICs_df['Species'] = penguins.species.values
ICs_df['Sex'] = penguins.sex.values
ICs_df.head()

IC1	IC2	Species	Sex
0	0.056393	0.024125	Adelie	Male
1	0.027792	0.043381	Adelie	Female
2	0.037019	0.028334	Adelie	Female
3	0.056176	0.027069	Adelie	Female
4	0.079024	-0.019165	Adelie	Male

We can understand what the ICs have captured by by making the scatter plot between the ICs and add color/shape to species/sex variables.

plt.figure(figsize=(12,10))
with sns.plotting_context("talk",font_scale=1.25):
    sns.scatterplot(x="IC1", y="IC2",
                    data=ICs_df, 
                    hue="Species",
                    style="Sex",
                    s=100)
    plt.xlabel("IC1")
    plt.ylabel("IC2")
    plt.title("ICA")
plt.savefig("ICA_Example_in_Python.png",
                    format='png',dpi=150)

We can see that the two ICs have nicely captured the variation driven by species and sex variables.

4. NMF: Non-Negative Matrix Factorization in Python

Non-negative Matrix Factorization (NMF), works with input data matrix that contains non-negative values and performs matrix factorization. Our scaled data obtained by StandardScaler() has negative values. So we will use the min-max scaler normalization that transforms the data 0-1 range and use that for performing NMF.

MMscaler = MinMaxScaler()
# transform data
MM_scaled = MMscaler.fit_transform(data)

nmf_model = NMF(n_components=2,
                init='random',
                max_iter=500,
                random_state=0)
W = nmf_model.fit_transform(MM_scaled)
H = nmf_model.components_

We used NMF() function available in scikit-learn and got two factors from the input data. Then we save the results as Pandas dataframe with species and sex values.

NMFs_df = pd.DataFrame(data = W , 
        columns = ['NMF1', 'NMF2'])
NMFs_df['Species'] = penguins.species.values
NMFs_df['Sex'] = penguins.sex.values
NMFs_df.head()

NMF1	NMF2	Species	Sex
0	0.127601	0.518228	Adelie	Male
1	0.198881	0.397550	Adelie	Female
2	0.203366	0.454378	Adelie	Female
3	0.148316	0.556137	Adelie	Female
4	0.146132	0.685184	Adelie	Male

The scatter oplot similar to PCA plot reveals the nice structure captured by NMF and the species and sex variables account for hidden structure obtained in unsupervised manner.

plt.figure(figsize=(12,10))
with sns.plotting_context("talk",font_scale=1.25):
    sns.scatterplot(x="NMF1", y="NMF2",
                    data=NMFs_df, 
                    hue="Species",
                    style="Sex",
                    s=100)
    plt.xlabel("NMF1")
    plt.ylabel("NMF2")
    plt.title("NMF")
plt.savefig("NMF_Example_in_Python.png",
                    format='png',dpi=150)

5. Factor Analysis

Factor analysis, another common dimensionality reduction technique, aims to capture the hidden factors in the data as factors.

In Python we can perform Factor Analysis using the FactorAnalysis() function available in scikit-learn.

transformer = FactorAnalysis(n_components=2, 
                             random_state=0)
FA_fit= transformer.fit_transform(scaled)
FA_fit.shape

FAs_df = pd.DataFrame(data = FA_fit , 
        columns = ['FA1', 'FA2'])
FAs_df['Species'] = penguins.species.values
FAs_df['Sex'] = penguins.sex.values
FAs_df.head()


FA1	FA2	Species	Sex
0	-1.329200	0.051197	Adelie	Male
1	-0.983690	-0.415561	Adelie	Female
2	-0.530587	-0.066072	Adelie	Female
3	-0.683030	0.384215	Adelie	Female
4	-0.857335	1.050659	Adelie	Male

By plotting the two factors as scatter plot, we see the familiar pattern driven by species and sex and captured by the latent factors inferred.

plt.figure(figsize=(12,10))
with sns.plotting_context("talk",font_scale=1.25):
    sns.scatterplot(x="FA1", y="FA2",
                    data=FAs_df, 
                    hue="Species",
                    style="Sex",
                    s=100)
    plt.xlabel("Factor 1")
    plt.ylabel("Factor 2")
    plt.title("Factor Analysis")
plt.savefig("FactorAnalysis_Example_in_Python.png",
                    format='png',dpi=150)

6. tSNE Example in Python

t-Distributed Stochastic Neighbor Embedding, t-SNE is a relatively new dimensionality reduction technique that is commonly used for visualizing high dimensional datasets. Unlike PCA/ICA/NMF, tSNE is a non-linear dimension reduction technique.

We can perform tSNE analysis using scikit-learn’s TSNE module as shown below.

from sklearn.manifold import TSNE
tsne = TSNE(n_components=2, random_state=0)

tsne_fit= tsne.fit_transform(scaled)
tsne_fit[0:5,0:2]

array([[  4.392819 ,  -9.887315 ],
       [  6.0089364, -12.803539 ],
       [  6.9806666, -15.289839 ],
       [  7.6817703,  -7.7815695],
       [  5.2514114,  -3.143555 ]], dtype=float32)

We save the tSNE results as a dataframe with Palmer penguin species names and sex variable.

tSNE_df = pd.DataFrame(data = tsne_fit , 
        columns = ['tSNE1', 'tSNE2'])
tSNE_df['Species'] = penguins.species.values
tSNE_df['Sex'] = penguins.sex.values
tSNE_df.head()

	tSNE1	tSNE2	Species	Sex
0	4.392819	-9.887315	Adelie	Male
1	6.008936	-12.803539	Adelie	Female
2	6.980667	-15.289839	Adelie	Female
3	7.681770	-7.781569	Adelie	Female
4	5.251411	-3.143555	Adelie	Male

tSNE plot made with the two components, colored by species and sex nicely show the structure in the four features we were using to perform tSNE.

plt.figure(figsize=(12,10))
with sns.plotting_context("talk",font_scale=1.25):
    sns.scatterplot(x="tSNE1", y="tSNE2",
                    data=tSNE_df, 
                    hue="Species",
                    style="Sex",
                    s=100)
    plt.xlabel("tSNE1")
    plt.ylabel("tSNE2")
    plt.title("tSNE")
plt.savefig("tSNE_Example_in_Python.png",
                    format='png',dpi=150)

7. UMAP Example in Python

UMAP aka Uniform Manifold Approximation and Projection for Dimension Reduction that is similar in flavor as tSNE. UMAP is also a non-linear dimension reduction technique and commonly used for visualisation high dimensional data.

Currently scikit-learn does not have UMAP. One has to install separately from Conda or pip.

import umap
reducer = umap.UMAP()

umap_fit = reducer.fit_transform(scaled)
umap_fit.shape
(333, 2)

umap_df = pd.DataFrame(data = umap_fit , 
        columns = ['UMAP1', 'UMAP2'])
umap_df['Species'] = penguins.species.values
umap_df['Sex'] = penguins.sex.values
umap_df.head()

	UMAP1	UMAP2	Species	Sex
0	11.074565	8.976759	Adelie	Male
1	11.175638	10.158457	Adelie	Female
2	10.951249	10.764420	Adelie	Female
3	10.011297	8.642998	Adelie	Female
4	10.638319	7.325177	Adelie	Male

plt.figure(figsize=(12,10))
with sns.plotting_context("talk",font_scale=1.25):
    sns.scatterplot(x="UMAP1", y="UMAP2",
                    data=umap_df, 
                    hue="Species",
                    style="Sex",
                    s=100)
    plt.xlabel("UMAP1")
    plt.ylabel("UMAP2")
    plt.title("UMAP")
plt.savefig("UMAP_Example_in_Python.png",
                    format='png',dpi=150)

One of the biggest highlights looking at the results as simple scatter plot between the first/top 2 components from 7 different dimensionality reduction techniques is that how similar the results are. This is true when a simpler data set like this with strong latent variables (like species and sex) all of the methods perform nicely and captured the latent factors.