• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Python / PCA in Python / PCA Example in Python with scikit-learn

PCA Example in Python with scikit-learn

March 18, 2018 by cmdlinetips

Principal Component Analysis (PCA) is one of the most useful techniques in Exploratory Data Analysis to understand the data, reduce dimensions of data and for unsupervised learning in general.

Let us quickly see a simple example of doing PCA analysis in Python. Here we will use scikit-learn to do PCA on a simulated data.

Let us load the basic packages needed for the PCA analysis

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
%matplotlib inline

Simulating data for PCA with scikit-learn

We will simulate data using scikit-learn’s make-blobs module in sklearn.datasets. And we will use PCA implemented in scikit-learn to do the PCA analysis.

# load make_blobs to simulate data
from sklearn.datasets import make_blobs
# load decomposition to do PCA analysis with sklearn
from sklearn import decomposition

Let us make simulated data using make_blobs. make_blobs is one of the modules available in scikit-learn to construct simulated data. make_blobs can be easily used to make data set with multiple gaussian clusters and is widely used to test clustering algorithms. Here we will use make_blobs to generate 100 x 10 matrix data, such that there are 100 samples with 10 observations. These 100 samples were generated from four different clusters. Since it is simulated, we know which cluster each sample belong to.

X1, Y1 = make_blobs(n_features=10, 
         n_samples=100,
         centers=4, random_state=4,
         cluster_std=2)
print(X1.shape)

Fitting PCA with scikit-learn

Here, X1 is the 100 x 10 data and Y1 is cluster assignment for the 100 samples. Let us create a PCA model with 4 components from sklearn.decomposition.

pca = decomposition.PCA(n_components=4)

The simulated data is already centered and scales, so we can go ahead and fit PCA model. We will fit PCA model using fit_transform function to our data X1 and the result pc contains the principal components.

pc = pca.fit_transform(X1)

Let us make a pandas data frame with the principal components (PCs) and the known cluster assignments.

pc_df = pd.DataFrame(data = pc , 
        columns = ['PC1', 'PC2','PC3','PC4'])
pc_df['Cluster'] = Y1
pc_df.head()

Variance Explained by PCs

Let us examine the variance explained by each principal component. We can clearly see that the first two principal components explains over 70% of the variation in the data.

>pca.explained_variance_ratio_
array([ 0.41594854,  0.3391866 ,  0.1600729 ,  0.02016822])

Let us plot the variance explained by each principal component. This is also called Scree plot.

df = pd.DataFrame({'var':pca.explained_variance_ratio_,
             'PC':['PC1','PC2','PC3','PC4']})
sns.barplot(x='PC',y="var", 
           data=df, color="c");
Variance Explained by Principal Components
Variance Explained by Principal Components

PCA plot

Now we can use the top two principal components and make scatter plot. We will use Seaborn’s lmplot to make the PCA plot using the fit_reg=False option and color clusters with ‘hue’.

sns.lmplot( x="PC1", y="PC2",
  data=pc_df, 
  fit_reg=False, 
  hue='Cluster', # color by cluster
  legend=True,
  scatter_kws={"s": 80}) # specify the point size

We can clearly see the four clusters in our data. The two principal components are able to completely separate the clusters.

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Data for fitting Gaussian Mixture Models PythonGaussian Mixture Models with Scikit-learn in Python Default ThumbnailPandas pipe function in Pandas: performing PCA Linear Problem vs Non-Linear ProblemIntroduction to Kernal PCA with Python PCA Plot with Penguin Scaled DataPrincipal Component Analysis with Penguins Data in Python

Filed Under: PCA example in Python, PCA in Python, Python Tips, Scikit-learn Tagged With: PCA example in Python, PCA in Python, PCA scikit-learn, Python Tips, scikit-learn

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version