• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Python / Dimensionality Reduction / Singular Value Decomposition (SVD) in Python

Singular Value Decomposition (SVD) in Python

May 25, 2019 by cmdlinetips

Matrix decomposition by Singular Value Decomposition (SVD) is one of the widely used methods for dimensionality reduction. For example, Principal Component Analysis often uses SVD under the hood to compute principal components.

In this post, we will work through an example of doing SVD in Python. We will use gapminder data in wide form to do the SVD analysis and use NumPy’s linalg.svd to do SVD.

Let us load the packages needed.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

We will use the gapminder data to apply SVD. We will download the gapminder data from Carpentries website.

data_url = "https://goo.gl/ioc2Td"
gapminder = pd.read_csv(data_url)
print(gapminder.head(3))


  continent  country  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
0    Africa  Algeria     2449.008185     3013.976023     2550.816880   
1    Africa   Angola     3520.610273     3827.940465     4269.276742   
2    Africa    Benin     1062.752200      959.601080      949.499064   

Let us filter the gapminder dataframe so that it contains only life expectancy values. We use Pandas’ pattern matching str.contains to select the columns for life expectancy.

lifeExp = gapminder.loc[:, gapminder.columns.str.contains('^life|^c')]
lifeExp.head()

  continent  country  lifeExp_1952  lifeExp_1957  lifeExp_1962  lifeExp_1967  \
0    Africa  Algeria        43.077        45.685        48.303        51.407   
1    Africa   Angola        30.015        31.999        34.000        35.985   
2    Africa    Benin        38.223        40.358        42.618        44.885   

For now let us also filter the data to contain countries from Africa and Europe.

>lifeExp_AE = lifeExp[lifeExp.continent.isin(["Africa","Europe"])]
>lifeExp_AE.shape

(82, 14)

Let us store the country and continent information separate as meta data.

lifeExp_meta = lifeExp_AE.loc[:, lifeExp_AE.columns.str.contains('^c')]
lifeExp_meta.head()

  continent       country
0    Africa       Algeria
1    Africa        Angola
2    Africa         Benin
3    Africa      Botswana
4    Africa  Burkina Faso

And let us drop the country and continent columns. Now the dataframe contains life expectancy values for each year.

lifeExp_AE = lifeExp_AE.drop(columns=['continent', 'country'])
lifeExp_AE.head(n=3)

   lifeExp_1952  lifeExp_1957  lifeExp_1962  lifeExp_1967  lifeExp_1972  \
0        43.077        45.685        48.303        51.407        54.518   
1        30.015        31.999        34.000        35.985        37.928   
2        38.223        40.358        42.618        44.885        47.014   

Before we actually perform SVD on our data set, let us mean center and scale so that each column of our data is on the same scale.

lifeExp_AE_scaled = (lifeExp_AE-lifeExp_AE.mean())/lifeExp_AE.std()
print(lifeExp_AE_scaled.head(n=3))


   lifeExp_1952  lifeExp_1957  lifeExp_1962  lifeExp_1967  lifeExp_1972  \
0     -0.394065     -0.362388     -0.318266     -0.220089     -0.116843   
1     -1.364384     -1.377154     -1.391081     -1.408750     -1.438966   
2     -0.754648     -0.757365     -0.744677     -0.722776     -0.714866   

Now we are all set to perform SVD on the gapminder life expectancy data from two continents. We use NumPy’s linalg module’s svd function to do SVD. In addition to the scaled data, we also specify “full_matrices=True” to get all singular vectors.

# SVD with Numpy's linalg.svd()
u, s, v = np.linalg.svd(lifeExp_AE_scaled, 
                        full_matrices=True)

The output of SVD is three matrices, u, s, and v. The matrices u and v are singular vectors and s is singular values. We can examine the dimensions of each with shape function.

>print(u.shape)
(82, 82)
>print(s.shape)
(12,)
>print(v.shape)
(12, 12)

Singular values help us compute variance explained by each singular vectors. We can visualize the percent variance explained by each singular vector or PC to understand the structure in the data

var_explained = np.round(s**2/np.sum(s**2), decimals=3)
var_explained

sns.barplot(x=list(range(1,len(var_explained)+1)),
            y=var_explained, color="limegreen")
plt.xlabel('SVs', fontsize=16)
plt.ylabel('Percent Variance Explained', fontsize=16)
plt.savefig('svd_scree_plot.png',dpi=100)

Here is the Scree plot giving us the percentage of variance explained by each singular vector. We can see that the first vector explains most of the variation in the data.

SVD Scree Plot
SVD Scree Plot

Let us create a data frame containing the first two singular vectors (PCs) and the meta data for the data.

labels= ['SV'+str(i) for i in range(1,3)]
svd_df = pd.DataFrame(u[:,0:2], index=lifeExp_meta["continent"].tolist(), columns=labels)
svd_df=svd_df.reset_index()
svd_df.rename(columns={'index':'Continent'}, inplace=True)
svd_df.head()

	Continent	SV1	SV2
0	Africa	0.014940	-0.212346
1	Africa	-0.172656	0.046238
2	Africa	-0.075906	-0.045773
3	Africa	-0.021360	0.189510
4	Africa	-0.111868	-0.052854

We can sue the data frame to make the PCA plot with the first vector on x-axis and the second one on y-axis. And we will also color the data points based on the continent variable. We use Seaborn’s scatterplot to make the plot with palette to specify the colors.

# specify colors for each continent
color_dict = dict({'Africa':'Black',
                   'Europe': 'Red'})
# Scatter plot: SV1 and SV2
sns.scatterplot(x="SV1", y="SV2", hue="Continent", 
                palette=color_dict, 
                data=svd_df, s=100,
                alpha=0.7)
plt.xlabel('SV 1: {0}%'.format(var_explained[0]*100), fontsize=16)
plt.ylabel('SV 2: {0}%'.format(var_explained[1]*100), fontsize=16)

We can clearly see the difference in lifeExp between Africa and Europe as they nicely cluster together with a few exceptions.

Singular Value Decomposition Plot
SVD Plot

In summary, we saw step-by-step example of using NumPy’s linalg.svd to do Singular Value Decomposition on gapminder dataset. We learned how to find the singular vectors or principal components relevant to our data. In a future post we will see more examples of using SVD in Python.

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default ThumbnailImage Reconstruction using Singular Value Decomposition (SVD) in Python Default ThumbnailSVD with Numpy Default ThumbnailHow to do QR Decomposition in Python with Numpy Default ThumbnailSVD: One Matrix Decomposition to Rule Them All

Filed Under: Dimensionality Reduction, Singular Value Decomposition, SVD in Python, SVD with NumPy Tagged With: Dimensionality Reduction, Singular Value Decomposition in Python, SVD in NumPy, SVD in Python

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version