Who does not love PCA with Penguins in Python. Sorry, could not resist saying this :). If you are tired of seeing Iris data for introducing all things Machine Learning, Data Science algorithms and Data Visualization examples, you are in for much needed treat in the form of Penguins.
Thanks to Alison Horst, who has made the wonderful dataset readily available for introducing exploratory data analysis, machine learning algorithms and data visualization.
The Penguin Data
were collected and made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a member of the Long Term Ecological Research Network.
Thank you to Dr. Gorman, Palmer Station LTER and the LTER Network! Special thanks to Marty Downs (Director, LTER Network Office) for help regarding the data license & use.
Let us get started with loading the packages we need. We first load our regular libraries Pandas, numpy, Seaborn, and matplotlib.
import matplotlib.pyplot as plt import pandas as pd import numpy as np import seaborn as sns
We will be using scikit-learn to do Principal Component Analysis with Penguins data. Let us load PCA module from scikit-learn. We will be using scikit-learn’s ability chain together multiple steps of analysis using “pipeline”.
from sklearn.decomposition import PCA from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler
Penguins Data: A great dataset to learn Data Visualization, Data Science and Machine Learning
Let us load the raw penguins data from Allison Horst’s github page.
# path to Penguins data p2data = "https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/data-raw/penguins_raw.csv"
We can see that the data contains a lot of information about Penguin species.
penguins_raw = pd.read_csv(p2data) penguins_raw.head() studyName Sample Number Species Region Island Stage Individual ID 0 PAL0708 1 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A1 1 PAL0708 2 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N1A2 2 PAL0708 3 Adelie Penguin (Pygoscelis adeliae) Anvers Torgersen Adult, 1 Egg Stage N2A1
We will be mainly using selected columns from the data for our PCA. Let us subset the data
columns_of_interest = ['Species', "Culmen Length (mm)", "Culmen Length (mm)", "Flipper Length (mm)", "Body Mass (g)", "Sex"] penguins_df = penguins_raw.loc[:,columns_of_interest]
Note that the names of Penguin species is pretty long. Let us just use simple name describing the Penguin species. And we will also remove any row with missing data.
# shorten penguin specie name penguins_df[['Species']]=penguins_df.Species.str.split(" ",expand=True,).loc[:,0] # replace "." to missing value penguins_df=penguins_df.replace(".", np.nan) # drop all rows containing missing value penguins_df=penguins_df.dropna()
Now, we have the data we need for doing PCA with sckit-learn.
penguins_df.head() Species Culmen Length (mm) Culmen Length (mm) Flipper Length (mm) Body Mass (g) Sex 0 Adelie 39.1 39.1 181.0 3750.0 MALE 1 Adelie 39.5 39.5 186.0 3800.0 FEMALE 2 Adelie 40.3 40.3 195.0 3250.0 FEMALE 4 Adelie 36.7 36.7 193.0 3450.0 FEMALE 5 Adelie 39.3 39.3 190.0 3650.0 MALE
Let us subset the data to contain only numeric data for PCA.
penguins_data=penguins_df.select_dtypes(np.number) penguins_data.head()
And also keep the Penguin species and sex information separate.
penguins_info=penguins_df.select_dtypes(exclude='float') penguins_info.head() Species Sex 0 Adelie MALE 1 Adelie FEMALE 2 Adelie FEMALE 4 Adelie FEMALE 5 Adelie MALE
penguins_info.Species.unique() array(['Adelie', 'Gentoo', 'Chinstrap'], dtype=object)
sex=penguins_info.Sex.tolist() species=penguins_info.Species.tolist()
PCA with Raw Data
We will first perform PCA with raw data and then do PCA with scaled data to illustrate the importance of scaling the data before doing PCA.
Let us use scikit-learn’s PCA fucntion to do the analysis. We first create a PCA model with 4 components. And then apply fit_transform() function providing the penguins data and perform PCA on the data.
pca = PCA(n_components=4) penguins_pca= pca.fit_transform(penguins_data)
We have the principal components ready after calling fit_transform() on the PCA model with the data. Let us create a dataframe with principal component
pc_df = pd.DataFrame(data = penguins_pca , columns = ['PC1', 'PC2','PC3', 'PC4']) pc_df.head() PC1 PC2 PC3 PC4 0 -457.339529 12.941050 4.560271 2.259745e-14 1 -407.266928 9.418435 2.184189 -4.475489e-16 2 -957.051463 -6.895631 -5.102509 1.679067e-16 3 -757.136970 0.900180 -6.930255 4.321797e-16 4 -557.188031 4.110899 -1.217727 1.447654e-16
And also add the sample level information to the data frame with PCs.
pc_df['Sex']=sex pc_df['Species']=species pc_df.head() PC1 PC2 PC3 PC4 Sex Species 0 -457.339529 12.941050 4.560271 2.259745e-14 MALE Adelie 1 -407.266928 9.418435 2.184189 -4.475489e-16 FEMALE Adelie 2 -957.051463 -6.895631 -5.102509 1.679067e-16 FEMALE Adelie 3 -757.136970 0.900180 -6.930255 4.321797e-16 FEMALE Adelie 4 -557.188031 4.110899 -1.217727 1.447654e-16 MALE Adelie
Let us first check the variance explained by each Principal Component. We can get the variance explained by each PC from explained_variance_ratio_ method on PCA model. A quick look at the variance show that, the first PC explains all of the variation.
pca.explained_variance_ratio_ array([9.99867796e-01, 8.99895963e-05, 4.22139074e-05, 2.47920196e-36])
Typically, just one PC explaining all the variation is a red flag. You might see such cases, when the features in the data are of very different ranges. Because of that just one variable with huge range could bias the PCA analysis. Clearly, this the case in our example as we have not scaled our data. Note, you might also see such behaviour when all features are very highly correlated.
Let us just go ahead and making PCA scatter plot with PC1 on x-axis and PC2 on y-axis. We can see that PC1 can separate the species in general.
import seaborn as sns plt.figure(figsize=(12,10)) with sns.plotting_context("notebook",font_scale=1.25): sns.scatterplot(x="PC1", y="PC2", data=pc_df, hue="Species", style="Sex", s=100)
PCA with Scaled Data
Now that we have seen an example of PCA on raw data without scaling, Let us do PCA on data set that is scaled. In Scikit-learn we can use StandardScalar() function to scale the data into data with mean zero and variance one. We will do PCA on the scaled data.
We can use Scikit-learn’s make_pipeline() to create a pipeline with these two steps.
random_state = 0 pca_scaled = make_pipeline(StandardScaler(), PCA(n_components=4, random_state=random_state))
Let us use fit_transform() on the pipeline for PCA with scaled data.
penguins_pc_scaled=pca_scaled.fit_transform(penguins_data)
We can take a quick peek at the scaled data that is used for PCA
pca_scaled.named_steps['standardscaler'].fit_transform(penguins_data) array([[-0.89604189, -0.89604189, -1.42675157, -0.56847478], [-0.82278787, -0.82278787, -1.06947358, -0.50628618], [-0.67627982, -0.67627982, -0.42637319, -1.1903608 ], ..., [ 1.02687621, 1.02687621, -0.56928439, -0.53738048], [ 1.24663828, 1.24663828, 0.64546078, -0.13315457], [ 1.13675725, 1.13675725, -0.2120064 , -0.53738048]])
Also we can also check what is in the PCA step of the pipeline.
pca_scaled.named_steps['pca'] PCA(copy=True, iterated_power='auto', n_components=4, random_state=0, svd_solver='auto', tol=0.0, whiten=False)
Let us get the proportion of variation explained by each principal component.
pca_scaled.named_steps['pca'].explained_variance_ratio_*100 array([7.95338843e+01, 1.73923807e+01, 3.07373502e+00, 2.79398725e-35])
We can see that the first PC explains 80 percent of the variation in the data and the second PC explains about 18% of the variation.
Let us create data frame with PCs from scaled data and also add the Penguin infomation for each sample.
pc_scaled_df = pd.DataFrame(data = penguins_pc_scaled , columns = ['PC1', 'PC2','PC3', 'PC4']) pc_scaled_df['Species'] = species pc_scaled_df['Sex'] = sex pc_scaled_df.head() PC1 PC2 PC3 PC4 Species Sex 0 -1.899358 0.105560 0.588102 8.403066e-18 Adelie MALE 1 -1.616865 -0.022060 0.373257 4.885022e-19 Adelie FEMALE 2 -1.472415 0.213019 -0.547795 1.644519e-18 Adelie FEMALE 3 -2.101064 -0.501786 -0.334550 -4.239310e-19 Adelie FEMALE 4 -1.601048 -0.082743 0.033770 8.429747e-19 Adelie MALE
Now we are ready to make visualization using PCA result. Let us first make scatter plot between PC1 and PC2, the two PCs that captures most the variations in the data. We also color the data points by species and change shape of data points by Sex.
plt.figure(figsize=(12,10)) with sns.plotting_context("talk",font_scale=1.25): sns.scatterplot(x="PC1", y="PC2", data=pc_scaled_df, hue="Species", style="Sex", s=100) plt.xlabel("PC1: "+f'{var_explained[0]:.0f}'+"%") plt.ylabel("PC2: "+f'{var_explained[1]:.0f}'+"%") plt.title("PCA with Scaled Penguins Data") plt.savefig("PCA_plot_PC1_vs_PC2_Penguins_scaled_data.png", format='png',dpi=150)
We can see that the PCA plot with scaled data nicely captures the structure in the data. The data points corresponding to each Penguin species is clearly clustered and well separated compared to the PCA plot with unscaled data. We can also see the effect of Sex more clearly now, as females weigh less compared to males on an average..
With little of squinting the scatter plot between PC1 and PC2 we can see how PC1 and PC2 independently has captured the variation between the species and sex. We can do a better job of understanding this by visualizing the PCs together with variables of interest.
At first, let us make a boxplot with species on x-axis and PC1 on y-axis. This would tell how much of the variation captured by PC1 is due to Species level differences in our data.
sns.plotting_context("talk",font_scale=1.25) plt.figure(figsize=(8,6)) sns.boxplot(x="Species",y="PC1", width=0.5, data=pc_scaled_df) plt.xlabel("Species", size=14) plt.ylabel("PC1: "+f'{var_explained[0]:.0f}'+"%", size=14) plt.title("PC1 vs Species: Penguins Data", size=16) plt.savefig("PCA_plot_PC1_vs_Species_Penguins_scaled_data.png", format='png',dpi=150)
We can clearly see how PC1 has captured the variation at Species level. FIrst principal component is telling how Adlie Penguins are different from the other two species.
Let us make boxplot between PC1 and Sex. Remember, PC2 captures about 18% pf the variation in the data.
plt.figure(figsize=(8,6)) sns.boxplot(x="Species",y="PC2", width=0.5, data=pc_scaled_df) plt.xlabel("Species", size=14) plt.ylabel("PC2: "+f'{var_explained[1]:.0f}'+"%", size=14) plt.title("PC2 vs Species: Penguins Data", size=16) plt.savefig("PCA_plot_PC2_vs_Species_Penguins_scaled_data.png", format='png',dpi=150)
We can see that PC2 captures the difference between how Chinstrap species is different from the other two species.
As principal components are linear combination of the original variables, PCs can also be correlated with other variables in the data. For example, if we make a boxplot between PC1 and Sex, we can see that Sex is correlated with PC1, showing that PC1 also captures the variation due to Sex.
We can get this information in our PCA plot as well, by squinting 🙂
If we make a boxplot bweet Sex and PC2, we can see that they is no association suggesting that PC2 does not explain Sex.
In summary, in this post we used a fantastic new data set on Penguin species. And showed how to do PCA with Python’s Scikit-learn tootlkit. More importantly, we learned how scaling the data matters by performing PCA without scaling the data and after scaling the data. And we also learned of ways to interpret the PCs obtained from the PCA analysis.
[…] a wide dataframe, but this time with real data and keep two columns as identifiers. We will use Penguins dataset to illustrate Pandas melt() usage in our example and load the data from github […]