Matrix decomposition by Singular Value Decomposition (SVD) is one of the widely used methods for dimensionality reduction. For example, Principal Component Analysis often uses SVD under the hood to compute principal components.
In this post, we will work through an example of doing SVD in Python. We will use gapminder data in wide form to do the SVD analysis and use NumPy’s linalg.svd to do SVD.
Let us load the packages needed.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns
We will use the gapminder data to apply SVD. We will download the gapminder data from Carpentries website.
data_url = "https://goo.gl/ioc2Td" gapminder = pd.read_csv(data_url) print(gapminder.head(3)) continent country gdpPercap_1952 gdpPercap_1957 gdpPercap_1962 \ 0 Africa Algeria 2449.008185 3013.976023 2550.816880 1 Africa Angola 3520.610273 3827.940465 4269.276742 2 Africa Benin 1062.752200 959.601080 949.499064
Let us filter the gapminder dataframe so that it contains only life expectancy values. We use Pandas’ pattern matching str.contains to select the columns for life expectancy.
lifeExp = gapminder.loc[:, gapminder.columns.str.contains('^life|^c')] lifeExp.head() continent country lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 \ 0 Africa Algeria 43.077 45.685 48.303 51.407 1 Africa Angola 30.015 31.999 34.000 35.985 2 Africa Benin 38.223 40.358 42.618 44.885
For now let us also filter the data to contain countries from Africa and Europe.
>lifeExp_AE = lifeExp[lifeExp.continent.isin(["Africa","Europe"])] >lifeExp_AE.shape (82, 14)
Let us store the country and continent information separate as meta data.
lifeExp_meta = lifeExp_AE.loc[:, lifeExp_AE.columns.str.contains('^c')] lifeExp_meta.head() continent country 0 Africa Algeria 1 Africa Angola 2 Africa Benin 3 Africa Botswana 4 Africa Burkina Faso
And let us drop the country and continent columns. Now the dataframe contains life expectancy values for each year.
lifeExp_AE = lifeExp_AE.drop(columns=['continent', 'country']) lifeExp_AE.head(n=3) lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 lifeExp_1972 \ 0 43.077 45.685 48.303 51.407 54.518 1 30.015 31.999 34.000 35.985 37.928 2 38.223 40.358 42.618 44.885 47.014
Before we actually perform SVD on our data set, let us mean center and scale so that each column of our data is on the same scale.
lifeExp_AE_scaled = (lifeExp_AE-lifeExp_AE.mean())/lifeExp_AE.std() print(lifeExp_AE_scaled.head(n=3)) lifeExp_1952 lifeExp_1957 lifeExp_1962 lifeExp_1967 lifeExp_1972 \ 0 -0.394065 -0.362388 -0.318266 -0.220089 -0.116843 1 -1.364384 -1.377154 -1.391081 -1.408750 -1.438966 2 -0.754648 -0.757365 -0.744677 -0.722776 -0.714866
Now we are all set to perform SVD on the gapminder life expectancy data from two continents. We use NumPy’s linalg module’s svd function to do SVD. In addition to the scaled data, we also specify “full_matrices=True” to get all singular vectors.
# SVD with Numpy's linalg.svd() u, s, v = np.linalg.svd(lifeExp_AE_scaled, full_matrices=True)
The output of SVD is three matrices, u, s, and v. The matrices u and v are singular vectors and s is singular values. We can examine the dimensions of each with shape function.
>print(u.shape) (82, 82) >print(s.shape) (12,) >print(v.shape) (12, 12)
Singular values help us compute variance explained by each singular vectors. We can visualize the percent variance explained by each singular vector or PC to understand the structure in the data
var_explained = np.round(s**2/np.sum(s**2), decimals=3) var_explained sns.barplot(x=list(range(1,len(var_explained)+1)), y=var_explained, color="limegreen") plt.xlabel('SVs', fontsize=16) plt.ylabel('Percent Variance Explained', fontsize=16) plt.savefig('svd_scree_plot.png',dpi=100)
Here is the Scree plot giving us the percentage of variance explained by each singular vector. We can see that the first vector explains most of the variation in the data.
Let us create a data frame containing the first two singular vectors (PCs) and the meta data for the data.
labels= ['SV'+str(i) for i in range(1,3)] svd_df = pd.DataFrame(u[:,0:2], index=lifeExp_meta["continent"].tolist(), columns=labels) svd_df=svd_df.reset_index() svd_df.rename(columns={'index':'Continent'}, inplace=True) svd_df.head() Continent SV1 SV2 0 Africa 0.014940 -0.212346 1 Africa -0.172656 0.046238 2 Africa -0.075906 -0.045773 3 Africa -0.021360 0.189510 4 Africa -0.111868 -0.052854
We can sue the data frame to make the PCA plot with the first vector on x-axis and the second one on y-axis. And we will also color the data points based on the continent variable. We use Seaborn’s scatterplot to make the plot with palette to specify the colors.
# specify colors for each continent color_dict = dict({'Africa':'Black', 'Europe': 'Red'}) # Scatter plot: SV1 and SV2 sns.scatterplot(x="SV1", y="SV2", hue="Continent", palette=color_dict, data=svd_df, s=100, alpha=0.7) plt.xlabel('SV 1: {0}%'.format(var_explained[0]*100), fontsize=16) plt.ylabel('SV 2: {0}%'.format(var_explained[1]*100), fontsize=16)
We can clearly see the difference in lifeExp between Africa and Europe as they nicely cluster together with a few exceptions.
In summary, we saw step-by-step example of using NumPy’s linalg.svd to do Singular Value Decomposition on gapminder dataset. We learned how to find the singular vectors or principal components relevant to our data. In a future post we will see more examples of using SVD in Python.