• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Pandas 101 / Pandas pipe function in Pandas: performing PCA

Pandas pipe function in Pandas: performing PCA

June 15, 2022 by cmdlinetips

Pandas pipe function can help us chain together functions that takes either dataframe or series as input. In this introductory tutorial, we will learn how to use Pandas pipe method to simplify code for data analysis. We start with a dataframe as input and do a series of analysis such that that each step takes output of previous step. One of the additional benefits of using pipe is that, we modularize each step by writing it as a function that takes a dataframe as input.

Let us get started by loading the Python packages needed to illustrate the benefit of using Pandas pipe method.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
from sklearn import decomposition

As an example, to illustrate Pandas pipe method, we will perform Principal Component Analysis in Python and make a PCA plot. When we do PCA, typically we might start with a data frame with both numerical and categorical variables. In such a scenario, the steps of doing PCA are

  1. Select the numerical columns from the input dataframe
  2. Remove any row with missing values
  3. Center and scale the data before doing PCA
  4. perform PCA with scikit’s decomposition module
  5. Combine original data and the PCs
  6. Make scatter plot between PC1 and PC2 to make PCA plot.

    We will use Palmer Penguins dataset available from Seaborn.

penguins = sns.load_dataset("penguins")
penguins.head()

To make our code easy to read and use Pandas pipe, let us write each step as a function that takes data frame as input. Note that most of these steps are really simple ones and we are writing them as a function to illustrate the use pipe method.

Step 1: Function to select numerical columns using select_dtypes()

The first step is to select numerical columns alone from a data frame containing different data types. With Pandas’ select_dtypes() function, we can select numerical columns in a dataframe.

def select_numeric_cols(df):
    return(df.select_dtypes('float'))

Step 2: Remove any rows with missing data with dropna()

PCA does not work if we have any missing values in our data. Here we simply remove the rows containing any missing values using Pandas dropna() function.

def remove_rows_with_NA(df):
    return(df.dropna())

Step 3: Normalize the data by centering and scaling

Normalization is a key step in doing PCA. Here we normalize the data by mean centering and scaling the variables.

def center_and_scale(df):
    df_centered = df.subtract(df.mean())
    df_scaled = (df_centered - df_centered.min())/(df_centered.max()-df_centered.min())
    return(df_scaled)

Step 4: perform PCA

With all the necessary preprocessing done, we are now ready to perform PCA. We use Scikit-learn’s decompositon module to do PCA and obtain the top 2 principal components.

def do_PCA(data):
    pca = decomposition.PCA(n_components=2)
    pc = pca.fit_transform(data)
    pc_df = pd.DataFrame(data = pc , 
        columns = ['PC1', 'PC2'])
    return(pc_df)

Step 5: Combine PCs with original data

Combining the PCs with the original data, we can further understand the relationship between PCs and the variables that are part of the original data.

def pcs_with_data(pcs, data):
    pc_aug = pd.concat([pcs, data], axis=1)
    return(pc_aug)

Step 6: Make PCA plot

Finally we make PCA plot, a scatter plot with PC1 on the x-axis and PC2 on the y-axis and points colored by one of the variables in the original data. In this example, we make the scatter plot using Seaborn’s scatterplot() function and color the points by “species” variable.

def pca_plot(pc_data):
    p1 = sns.scatterplot(x="PC1", y="PC2", hue="species", data=pc_data)
    return(p1)

Now using Pandas pipe() function, we can chain each step or each function we just wrote to perform PCA and make the PCA plot. And the code using pipe() looks like this, where we provide the function corresponding to each step as input. The next pipe() function uses the output from previous function as input.

(penguins.
 pipe(select_numeric_cols).
 pipe(remove_rows_with_NA).
 pipe(center_and_scale).
 pipe(do_PCA).
 pipe(pcs_with_data, penguins.dropna()).
 pipe(pca_plot))

And voila, at the end we get the nice PCA plot that we aimed for.

Pandas Pipe method to perform PCA
Pandas Pipe Method with Examples

H/T to Matt Harrison’s tweet introducing the Pandas pipe function.

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Linear Problem vs Non-Linear ProblemIntroduction to Kernal PCA with Python PCA Example in Python with scikit-learn Default ThumbnailHow To Write Pandas GroupBy Function using Sparse Matrix? Missing Values Count with isna()How To Get Number of Missing Values in Each Column in Pandas

Filed Under: Pandas 101, Python Tips Tagged With: Pandas pipe method, PCA example using Pandas Pipe Method

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version