Pandas pipe function in Pandas: performing PCA

Pandas pipe function can help us chain together functions that takes either dataframe or series as input. In this introductory tutorial, we will learn how to use Pandas pipe method to simplify code for data analysis. We start with a dataframe as input and do a series of analysis such that that each step takes output of previous step. One of the additional benefits of using pipe is that, we modularize each step by writing it as a function that takes a dataframe as input.

Let us get started by loading the Python packages needed to illustrate the benefit of using Pandas pipe method.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import decomposition

As an example, to illustrate Pandas pipe method, we will perform Principal Component Analysis in Python and make a PCA plot. When we do PCA, typically we might start with a data frame with both numerical and categorical variables. In such a scenario, the steps of doing PCA are

Select the numerical columns from the input dataframe
Remove any row with missing values
Center and scale the data before doing PCA
perform PCA with scikit’s decomposition module
Combine original data and the PCs
Make scatter plot between PC1 and PC2 to make PCA plot.
We will use Palmer Penguins dataset available from Seaborn.

penguins = sns.load_dataset("penguins")
penguins.head()

To make our code easy to read and use Pandas pipe, let us write each step as a function that takes data frame as input. Note that most of these steps are really simple ones and we are writing them as a function to illustrate the use pipe method.

Step 1: Function to select numerical columns using select_dtypes()

The first step is to select numerical columns alone from a data frame containing different data types. With Pandas’ select_dtypes() function, we can select numerical columns in a dataframe.

def select_numeric_cols(df):
    return(df.select_dtypes('float'))

Step 2: Remove any rows with missing data with dropna()

PCA does not work if we have any missing values in our data. Here we simply remove the rows containing any missing values using Pandas dropna() function.

def remove_rows_with_NA(df):
    return(df.dropna())

Step 3: Normalize the data by centering and scaling

Normalization is a key step in doing PCA. Here we normalize the data by mean centering and scaling the variables.

def center_and_scale(df):
    df_centered = df.subtract(df.mean())
    df_scaled = (df_centered - df_centered.min())/(df_centered.max()-df_centered.min())
    return(df_scaled)

Step 4: perform PCA

With all the necessary preprocessing done, we are now ready to perform PCA. We use Scikit-learn’s decompositon module to do PCA and obtain the top 2 principal components.

def do_PCA(data):
    pca = decomposition.PCA(n_components=2)
    pc = pca.fit_transform(data)
    pc_df = pd.DataFrame(data = pc , 
        columns = ['PC1', 'PC2'])
    return(pc_df)

Step 5: Combine PCs with original data

Combining the PCs with the original data, we can further understand the relationship between PCs and the variables that are part of the original data.

def pcs_with_data(pcs, data):
    pc_aug = pd.concat([pcs, data], axis=1)
    return(pc_aug)

Step 6: Make PCA plot

Finally we make PCA plot, a scatter plot with PC1 on the x-axis and PC2 on the y-axis and points colored by one of the variables in the original data. In this example, we make the scatter plot using Seaborn’s scatterplot() function and color the points by “species” variable.

def pca_plot(pc_data):
    p1 = sns.scatterplot(x="PC1", y="PC2", hue="species", data=pc_data)
    return(p1)

Now using Pandas pipe() function, we can chain each step or each function we just wrote to perform PCA and make the PCA plot. And the code using pipe() looks like this, where we provide the function corresponding to each step as input. The next pipe() function uses the output from previous function as input.

(penguins.
 pipe(select_numeric_cols).
 pipe(remove_rows_with_NA).
 pipe(center_and_scale).
 pipe(do_PCA).
 pipe(pcs_with_data, penguins.dropna()).
 pipe(pca_plot))

And voila, at the end we get the nice PCA plot that we aimed for.