Singular Value Decomposition (SVD) in Python

Matrix decomposition by Singular Value Decomposition (SVD) is one of the widely used methods for dimensionality reduction. For example, Principal Component Analysis often uses SVD under the hood to compute principal components.

In this post, we will work through an example of doing SVD in Python. We will use gapminder data in wide form to do the SVD analysis and use NumPy’s linalg.svd to do SVD.

Let us load the packages needed.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

We will use the gapminder data to apply SVD. We will download the gapminder data from Carpentries website.

data_url = "https://goo.gl/ioc2Td"
gapminder = pd.read_csv(data_url)
print(gapminder.head(3))


  continent  country  gdpPercap_1952  gdpPercap_1957  gdpPercap_1962  \
0    Africa  Algeria     2449.008185     3013.976023     2550.816880   
1    Africa   Angola     3520.610273     3827.940465     4269.276742   
2    Africa    Benin     1062.752200      959.601080      949.499064   

Let us filter the gapminder dataframe so that it contains only life expectancy values. We use Pandas’ pattern matching str.contains to select the columns for life expectancy.

lifeExp = gapminder.loc[:, gapminder.columns.str.contains('^life|^c')]
lifeExp.head()

  continent  country  lifeExp_1952  lifeExp_1957  lifeExp_1962  lifeExp_1967  \
0    Africa  Algeria        43.077        45.685        48.303        51.407   
1    Africa   Angola        30.015        31.999        34.000        35.985   
2    Africa    Benin        38.223        40.358        42.618        44.885   

For now let us also filter the data to contain countries from Africa and Europe.

>lifeExp_AE = lifeExp[lifeExp.continent.isin(["Africa","Europe"])]
>lifeExp_AE.shape

(82, 14)

Let us store the country and continent information separate as meta data.

lifeExp_meta = lifeExp_AE.loc[:, lifeExp_AE.columns.str.contains('^c')]
lifeExp_meta.head()

  continent       country
0    Africa       Algeria
1    Africa        Angola
2    Africa         Benin
3    Africa      Botswana
4    Africa  Burkina Faso

And let us drop the country and continent columns. Now the dataframe contains life expectancy values for each year.

lifeExp_AE = lifeExp_AE.drop(columns=['continent', 'country'])
lifeExp_AE.head(n=3)

   lifeExp_1952  lifeExp_1957  lifeExp_1962  lifeExp_1967  lifeExp_1972  \
0        43.077        45.685        48.303        51.407        54.518   
1        30.015        31.999        34.000        35.985        37.928   
2        38.223        40.358        42.618        44.885        47.014   

Before we actually perform SVD on our data set, let us mean center and scale so that each column of our data is on the same scale.

lifeExp_AE_scaled = (lifeExp_AE-lifeExp_AE.mean())/lifeExp_AE.std()
print(lifeExp_AE_scaled.head(n=3))


   lifeExp_1952  lifeExp_1957  lifeExp_1962  lifeExp_1967  lifeExp_1972  \
0     -0.394065     -0.362388     -0.318266     -0.220089     -0.116843   
1     -1.364384     -1.377154     -1.391081     -1.408750     -1.438966   
2     -0.754648     -0.757365     -0.744677     -0.722776     -0.714866   

Now we are all set to perform SVD on the gapminder life expectancy data from two continents. We use NumPy’s linalg module’s svd function to do SVD. In addition to the scaled data, we also specify “full_matrices=True” to get all singular vectors.

# SVD with Numpy's linalg.svd()
u, s, v = np.linalg.svd(lifeExp_AE_scaled, 
                        full_matrices=True)

The output of SVD is three matrices, u, s, and v. The matrices u and v are singular vectors and s is singular values. We can examine the dimensions of each with shape function.

>print(u.shape)
(82, 82)
>print(s.shape)
(12,)
>print(v.shape)
(12, 12)

Singular values help us compute variance explained by each singular vectors. We can visualize the percent variance explained by each singular vector or PC to understand the structure in the data

var_explained = np.round(s**2/np.sum(s**2), decimals=3)
var_explained

sns.barplot(x=list(range(1,len(var_explained)+1)),
            y=var_explained, color="limegreen")
plt.xlabel('SVs', fontsize=16)
plt.ylabel('Percent Variance Explained', fontsize=16)
plt.savefig('svd_scree_plot.png',dpi=100)

Here is the Scree plot giving us the percentage of variance explained by each singular vector. We can see that the first vector explains most of the variation in the data.

SVD Scree Plot
SVD Scree Plot

Let us create a data frame containing the first two singular vectors (PCs) and the meta data for the data.

labels= ['SV'+str(i) for i in range(1,3)]
svd_df = pd.DataFrame(u[:,0:2], index=lifeExp_meta["continent"].tolist(), columns=labels)
svd_df=svd_df.reset_index()
svd_df.rename(columns={'index':'Continent'}, inplace=True)
svd_df.head()

	Continent	SV1	SV2
0	Africa	0.014940	-0.212346
1	Africa	-0.172656	0.046238
2	Africa	-0.075906	-0.045773
3	Africa	-0.021360	0.189510
4	Africa	-0.111868	-0.052854

We can sue the data frame to make the PCA plot with the first vector on x-axis and the second one on y-axis. And we will also color the data points based on the continent variable. We use Seaborn’s scatterplot to make the plot with palette to specify the colors.

# specify colors for each continent
color_dict = dict({'Africa':'Black',
                   'Europe': 'Red'})
# Scatter plot: SV1 and SV2
sns.scatterplot(x="SV1", y="SV2", hue="Continent", 
                palette=color_dict, 
                data=svd_df, s=100,
                alpha=0.7)
plt.xlabel('SV 1: {0}%'.format(var_explained[0]*100), fontsize=16)
plt.ylabel('SV 2: {0}%'.format(var_explained[1]*100), fontsize=16)

We can clearly see the difference in lifeExp between Africa and Europe as they nicely cluster together with a few exceptions.

SVD Plot

In summary, we saw step-by-step example of using NumPy’s linalg.svd to do Singular Value Decomposition on gapminder dataset. We learned how to find the singular vectors or principal components relevant to our data. In a future post we will see more examples of using SVD in Python.