PCA Example in Python with scikit-learn

Principal Component Analysis (PCA) is one of the most useful techniques in Exploratory Data Analysis to understand the data, reduce dimensions of data and for unsupervised learning in general.

Let us quickly see a simple example of doing PCA analysis in Python. Here we will use scikit-learn to do PCA on a simulated data.

Let us load the basic packages needed for the PCA analysis

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline

Simulating data for PCA with scikit-learn

We will simulate data using scikit-learn’s make-blobs module in sklearn.datasets. And we will use PCA implemented in scikit-learn to do the PCA analysis.

# load make_blobs to simulate data
from sklearn.datasets import make_blobs
# load decomposition to do PCA analysis with sklearn
from sklearn import decomposition

Let us make simulated data using make_blobs. make_blobs is one of the modules available in scikit-learn to construct simulated data. make_blobs can be easily used to make data set with multiple gaussian clusters and is widely used to test clustering algorithms. Here we will use make_blobs to generate 100 x 10 matrix data, such that there are 100 samples with 10 observations. These 100 samples were generated from four different clusters. Since it is simulated, we know which cluster each sample belong to.

X1, Y1 = make_blobs(n_features=10, 
         n_samples=100,
         centers=4, random_state=4,
         cluster_std=2)
print(X1.shape)

Fitting PCA with scikit-learn

Here, X1 is the 100 x 10 data and Y1 is cluster assignment for the 100 samples. Let us create a PCA model with 4 components from sklearn.decomposition.

pca = decomposition.PCA(n_components=4)

The simulated data is already centered and scales, so we can go ahead and fit PCA model. We will fit PCA model using fit_transform function to our data X1 and the result pc contains the principal components.

pc = pca.fit_transform(X1)

Let us make a pandas data frame with the principal components (PCs) and the known cluster assignments.

pc_df = pd.DataFrame(data = pc , 
        columns = ['PC1', 'PC2','PC3','PC4'])
pc_df['Cluster'] = Y1
pc_df.head()

Variance Explained by PCs

Let us examine the variance explained by each principal component. We can clearly see that the first two principal components explains over 70% of the variation in the data.

>pca.explained_variance_ratio_
array([ 0.41594854,  0.3391866 ,  0.1600729 ,  0.02016822])

Let us plot the variance explained by each principal component. This is also called Scree plot.

df = pd.DataFrame({'var':pca.explained_variance_ratio_,
             'PC':['PC1','PC2','PC3','PC4']})
sns.barplot(x='PC',y="var", 
           data=df, color="c");
Variance Explained by Principal Components

PCA plot

Now we can use the top two principal components and make scatter plot. We will use Seaborn’s lmplot to make the PCA plot using the fit_reg=False option and color clusters with ‘hue’.

sns.lmplot( x="PC1", y="PC2",
  data=pc_df, 
  fit_reg=False, 
  hue='Cluster', # color by cluster
  legend=True,
  scatter_kws={"s": 80}) # specify the point size

We can clearly see the four clusters in our data. The two principal components are able to completely separate the clusters.