• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Python / Computing Quantile Normalization in Python

Computing Quantile Normalization in Python

June 5, 2020 by cmdlinetips

Quantile Normalization in Python
Quantile Normalization in Python
When working with high-dimensional data, preprocessing and normalizing the data are key important steps in doing data analysis. Quantile normalization is one such statistical methods that can be useful in analyzing high-dimensional datasets. One of the main goals performing normalization like Quantile normalization is to transform the raw data such that we can remove any unwanted variation due to technical artifacts and preserve the actual variation that we are interested in studying. Quantile normalization is widely adopted in fields like genomics, but it can be useful in any high-dimensional setting.

In this post, we will learn how to implement quantile normalization in Python using Pandas and Numpy. We will implement the quantile normalization algorithm step-by-by with a toy data set. Then we will wrap that as a function to apply a simulated dataset. Finally we will examples of couple of visualizations to see how the data looked before and after quantile normalization.

Let us first load the packages needed for implementing Quantile Normalization in Python and illustrating the steps to compute quantile normalization.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy.stats import poisson

Say, you have hundreds or thousands of observations from multiple samples. Quantile normalization is a normalization method that assumes statistical distribution of each sample is exactly the same.

Normalization is achieved by forcing the observed distributions to be the same and the average distribution, obtained by taking the average of each quantile across samples, is used as the reference.

The figure below nicely illustrates the steps needed to perform quantile normalization. And we will follow the steps to implement it in Python. The figure is taken from a recent paper in bioRxiv, titled “When to Use Quantile Normalization?”. Check out the paper for more details on quantile normalization.

Algorithm to Compute Quantile Normalization
Algorithm to Compute Quantile Normalization (Credit: Hick and Irizarry, bioRxiv, 2014)

Let us create a dataframe with some toy data to do quantile normalization. The dataframe here contains the same data as the WikiPedia page on quantile normalization.

df = pd.DataFrame({'C1': {'A': 5, 'B': 2, 'C': 3, 'D': 4},
                   'C2': {'A': 4, 'B': 1, 'C': 4, 'D': 2},
                   'C3': {'A': 3, 'B': 4, 'C': 6, 'D': 8}})

Our toy dataframe has three columns and four rows.

print(df)

       C1	C2	C3
A	5	4	3
B	2	1	4
C	3	4	6
D	4	2	8

Step 1: Order values in each column

The first step in performing quantile normalization is to sort each column (each sample) independently. To sort all the columns independently, we use NumPy sort() function on the values from the dataframe. Since we lose the column and index names with Numpy, we create a new sorted dataframe using the sorted results with index and column names.

df_sorted = pd.DataFrame(np.sort(df.values, axis=0), index=df.index, columns=df.columns)

The dataframe after sorting each column looks like this. By doing this, we are grouping observations with high/low values together.

df_sorted

	C1	C2	C3
A	2	1	3
B	3	2	4
C	4	4	6
D	5	4	8

Step2: Compute Row Means

Since we have sorted each sample’s data independently, the average value each obeservation i.e. each row is in ascending order.

Next step is compute the average of each observartion. We use the sorted dataframe and compute mean of each row using Panda’s mean() with axis=1 argument.

df_mean = df_sorted.mean(axis=1)

We get mean values of each row after sorting with original index.

print(df_mean)

A    2.000000
B    3.000000
C    4.666667
D    5.666667
dtype: float64

These mean values will replace the orginal data in each column, such that we preserve the order of each observation or featur in Samples/Columns. This basically forces all the samples to have the same distributions.

Note that the mean values in ascending order, the first value is lowest rank and the last is of highest rank. Let us change the index to reflect that the mean we computed is ranked from from low to high. To do do that we use index function assign ranks sorting from 1. Note our index starts at 1, reflecting that it is a rank.

df_mean.index = np.arange(1, len(df_mean) + 1)
df_mean
1    2.000000
2    3.000000
3    4.666667
4    5.666667
dtype: float64

Step3: Use Average Values to each sample in the original order

The third and final step is to use the row average values (mean quantile) and replace them in place of raw data in the right order. What this means is, if the original data of first sample at first element is the smallest in the sample, we will replace the original value with new smallest value of row mean.

In our toy example, we can see that the first element of the third column C3 is 2 and it is the smallest in column C3. So we will use the smallest row mean 2 as its replacement. Similarly, the second element of C3 in original data has 4 and it is the second smallest in C3, so we will replace with 3.0, which is the second smallest in row mean.

To implement this, we need to get rank of original data for each column independently. We can use Pandas’ rank function to get that.

df.rank(method="min").astype(int)

C1	C2	C3
A	4	3	1
B	1	1	2
C	2	3	3
D	3	2	4

Now that we have the rank dataframe, we can use the rank to replace it with average values. One way to do that is to conver the rank dataframe in wide to rank data frame in tidy long form. We can use stack() function to reshape the data in wide form to tidy/long form.

df.rank(method="min").stack().astype(int)
A  C1    4
   C2    3
   C3    1
B  C1    1
   C2    1
   C3    2
C  C1    2
   C2    3
   C3    3
D  C1    3
   C2    2
   C3    4
dtype: int64

Then all we need to do is to map our row mean data with rank as index to rank colum of the tidy data. We can nicely chain each operation and get data that is quantile normalized. In the code below, we have reshaped the tidy normalized data to wide form as need.

df_qn =df.rank(method="min").stack().astype(int).map(df_mean).unstack()
df_qn

Now we have our quantile normalized dataframe.

	C1	C2	C3
A	5.666667	4.666667	2.000000
B	2.000000	2.000000	3.000000
C	3.000000	4.666667	4.666667
D	4.666667	3.000000	5.666667

Python Function to Compute Quantile Normalization

Step by step code for the toy example is helpful to understand how quantile normalization is implemented. Let us wrap the statements in to a function and try on slightly realistic data set.

def quantile_normalize(df):
    """
    input: dataframe with numerical columns
    output: dataframe with quantile normalized values
    """
    df_sorted = pd.DataFrame(np.sort(df.values,
                                     axis=0), 
                             index=df.index, 
                             columns=df.columns)
    df_mean = df_sorted.mean(axis=1)
    df_mean.index = np.arange(1, len(df_mean) + 1)
    df_qn =df.rank(method="min").stack().astype(int).map(df_mean).unstack()
    return(df_qn)

Let us generate dataset with three columns and 5000 rows/observation. We use Poisson random distribution with different mean to generate the three columns of data.

c1= poisson.rvs(mu=10, size=5000)
c2= poisson.rvs(mu=15, size=5000)
c3= poisson.rvs(mu=20, size=5000)
df=pd.DataFrame({"C1":c1,
                 "C2":c2,
                 "C3":c3})

Visualizaing the Effect of Quantile Normalization

One of the ways to viusalize the original raw data is to make density plot. Here we use Pandas’ plotting capability to make multiple density plots of the raw data.


df.plot.density(linewidth=4)

We can see that each distribution is distinct as we intended.

Density Plot of Raw Data
Density Plot of Raw Data: Before Quantile Normalization

Let us apply our function to compute quantile normalized data.

# compute quantile normalized data
df_qn=quantile_normalize(df)

Let us make the density plot again, but this time with the quantile normalized data.

df_qn.plot.density(linewidth=4)
plt.title("Density plot after Quantile Normalization")
plt.savefig('Density_plot_after_Quantile_Normalization_Pandas.png',dpi=150)

We can see that the density plot of quantile normalized data looks very similar to each other as we exprected.

Density plot after Quantile Normalization
Density plot after Quantile Normalization

Another way visualize the effect of quantile normalization to a data set is to use boxplot of each column/variable.

Let u make boxplots of original data before normalization. We use Seaborn’s boxplot to make boxplot using the wide form of data.

sns.boxplot(data=df)
# set x-axis label
plt.xlabel("Samples", size=18)
# set y-axis label
plt.ylabel("Measurement", size=18)
plt.title("Boxplot of raw data before Quantile Normalization")
plt.savefig('Boxplot_before_Quantile_Normalization_Seaborn.png',dpi=150)

We can see that the three distributions have different mean/median.

Boxplot before Quantile Normalization Seaborn
Boxplot before Quantile Normalization Seaborn

Now let us make boxplots using quantile normalized data.

sns.boxplot(data=df_qn)
# set x-axis label
plt.xlabel("Samples", size=18)
# set y-axis label
plt.ylabel("Measurement", size=18)
plt.title("Boxplot after Quantile Normalization")
plt.savefig('Boxplot_after_Quantile_Normalization_Seaborn.png',dpi=150)

By design we can see that all three boxplots corresponding to the three column look very similar.

Boxplot after Quantile Normalization Seaborn
Boxplot after Quantile Normalization Seaborn

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Correlation Heatmap: Lower Triangle with SeabornHow To Make Lower Triangle Heatmap with Correlation Matrix in Python? Default ThumbnailPearson and Spearman Correlation in Python How To Select Columns in Python Pandas?How To Select One or More Columns in Pandas? Default ThumbnailPandas Groupby and Computing Median

Filed Under: Python Tagged With: Python, Quantile Normalization

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version