• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar

Python and R Tips

Learn Data Science with Python and R

  • Home
  • Python
  • Pandas
    • Pandas 101
  • tidyverse
    • tidyverse 101
  • R
  • Linux
  • Conferences
  • Python Books
  • About
    • Privacy Policy
You are here: Home / Machine Learning / Standardized Score / How To Compute Z-scores in Python

How To Compute Z-scores in Python

December 7, 2020 by cmdlinetips

Computing standardized values of one or more columns is an important step for many machine learning analysis. For example, if we are using dimentionality reduction techniques like Principal Component Analysis (PCA), we will typically standardize all the variables.

To standardize a variable we subtract each value of the variable by mean of the variable and divide by the standard deviation of the variable. This basically transforms the variable to have normal distribution with zero-mean and unit variance.

Standardizing A Variable in Python
Standardizing A Variable in Python

Standardization of a variable is also called computing z-scores. It is basically the “the number of standard deviations by which the value is away from mean value of the variable. When the raw value is above mean value, the standardized value or z-score is positive. When the original value of the variable is below the mean value, the standardized value or score is negative.

In this post, we will see three ways to compute standardized scores for multiple variables in a Pandas dataframe.

  1. First, we will use Pandas functionalities to manually compute standardized scores for all columns at the same time.
  2. Next, we will use Numpy and compute standarized scores.
  3. And finally, we will use scikit-learn’s module to compute standardized scores or z-scores of all columns in a data frame.

Let us import the packages needed for computing standardized scores and visualizing them in Python.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

We will use palmer penguins dataset available from Seaborn’s built-in datasets and remove missing data to keep it simple.

# load data from Seaborn
penguins = sns.load_dataset("penguins")
# remove rows with missing values
penguins = penguins.dropna()

Since we are only interested in numerical variables, we select the columns that are numeric.

data = penguins.select_dtypes(float)
data.head()

bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	39.1	18.7	181.0	3750.0
1	39.5	17.4	186.0	3800.0
2	40.3	18.0	195.0	3250.0
4	36.7	19.3	193.0	3450.0
5	39.3	20.6	190.0	3650.0

We can see that each column has very different ranges. We can quickly check the average values each variable and see how different they are.

df=data.mean().reset_index(name="avg")
df

index	avg
0	bill_length_mm	43.992793
1	bill_depth_mm	17.164865
2	flipper_length_mm	200.966967
3	body_mass_g	4207.057057

Using density plots, we can also see how different their distributions are. Using raw data as it is can bias most of the machine learning methods.

Multiple Density plot of raw data
Multiple Density plot of raw data

Standardizing multiple variables with Pandas

We can standardize all numerical variables in the dataframe using Pandas vectorized functions. Here we compute column means with mean() function and standard deviation with std() function for all the columns/variables in the data frame. We can subtract the column mean and divide by standard deviation to compute standardized values for all columns at the same time.

data_z = (data-data.mean())/(data.std())

Our standardized values should have zero mean for all columns and and unit variance. We can verify that by making a density plot as shown below.

sns.kdeplot(data=data_z)
Density plot of Standardized Variables with Pandas
Density plot of Standardized Variables with Pandas

Let us also check by computing mean and standard deviation on each variable.

data_z.mean()
bill_length_mm      -2.379811e-15
bill_depth_mm       -1.678004e-15
flipper_length_mm    2.110424e-16
body_mass_g          1.733682e-17
dtype: float64

Let us check the standard deviations of the standaridized scores.

data_z.std()

bill_length_mm       1.0
bill_depth_mm        1.0
flipper_length_mm    1.0
body_mass_g          1.0
dtype: float64

How To Compute Standardized Values or Z-score with Numpy?

We can also use NumPy and compute standardized scores on multiple columns using vectorized operations. First, let us convert the pandas dataframe into a numpy array using to_numpy() function available in Pandas.

data_mat = data.to_numpy()

We can use NumPy’s mean() and std() function to compute mean and standard deviations and use them to compute the standardized scores. Note that we have specified axis to compute column mean and std().

data_z_np = (data_mat - np.mean(data_mat, axis=0)) / np.std(data_mat, axis=0)

With NumPy, we get our standardized scores as a NumPy array. Let us us convert the numpy array into a Pandas dataframe using DataFrame() function.

data_z_np_df = pd.DataFrame(data_z_np, 
                            index=data.index, 
                            columns=data.columns)

And this is our new standardized data and we can check the mean and standard deviation as shown before.

data_z_np_df.head()
bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	-0.896042	0.780732	-1.426752	-0.568475
1	-0.822788	0.119584	-1.069474	-0.506286
2	-0.676280	0.424729	-0.426373	-1.190361
4	-1.335566	1.085877	-0.569284	-0.941606
5	-0.859415	1.747026	-0.783651	-0.692852

How To Standardize Multiple Variables with scikit-learn?

We can standardize one or more variables using scikit-learn’s preprocessing module. For standardizing variables, we use StandardScaler from sklearn.preprocessing.

from sklearn.preprocessing import StandardScaler

We follow the typical scikity-learn approach, first by creating an instance of StandardScaler() and fitting the data to compute standardized scores for all variables.

nrmlzd = StandardScaler()
data_std =nrmlzd.fit_transform(data)

scikit-learn also gives the results as a numpy array and we can create Pandas dataframe as before.

data_std= pd.DataFrame(data_std, 
                       index=data.index,
                       columns=data.columns)
data_std

bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	-0.896042	0.780732	-1.426752	-0.568475
1	-0.822788	0.119584	-1.069474	-0.506286
2	-0.676280	0.424729	-0.426373	-1.190361
4	-1.335566	1.085877	-0.569284	-0.941606
5	-0.859415	1.747026	-0.783651	-0.692852

Let us verify the mean and standard deviation of the standardized scores.

data_std.mean()
bill_length_mm       1.026873e-16
bill_depth_mm   <a href="https://cmdlinetips.com/2020/06/principal-component-analysis-with-penguins-data-in-python/"></a>     3.267323e-16
flipper_length_mm    5.697811e-16
body_mass_g          2.360474e-16
dtype: float64
data_std.std()
bill_length_mm       1.001505
bill_depth_mm        1.001505
flipper_length_mm    1.001505
body_mass_g          1.001505
dtype: float64

You might notice that the standardized scores computed by Pandas differ from the scores computed by NumPy and scikit-learn. This is most likely due to the differences in the way sample standard deviation computed by Pandas is different from NumPy and scikit-learn.

However, they are not wildly different as we can see they differn in third digit. Here is the density plot of standardized scores from scikit-learn and we can verify that it has mean zero and looks the same as computed by Pandas.

sns.kdeplot(data=data_std)
Density plot of Standardized Variables: sklearn StandardScalar
Density plot of Standardized Variables: sklearn StandardScalar

Are you wondering how much difference whether you standardize the variables or not can make in doing analysis? Check out the relevance of standardizing the data while doing PCA here.

Share this:

  • Click to share on Facebook (Opens in new window) Facebook
  • Click to share on X (Opens in new window) X

Related posts:

Default ThumbnailHow to Compute Summary Statistics Across Multiple Columns in R Default ThumbnailHow To Delete Rows in Pandas Dataframe How to Code Character Variable as Integers with Pandas?How To Code a Character Variable into Integer in Pandas Default ThumbnailPandas filter(): Select Columns and Rows by Labels in a Dataframe

Filed Under: Machine Learning, Pandas 101, Python, Standardized Score, Z-score Python Tagged With: Pandas 101, Python

Reader Interactions

Trackbacks

  1. Introduction to Canonical Correlation Analysis (CCA) in Python - Python and R Tips says:
    December 25, 2020 at 9:48 pm

    […] also need to standardize the variables by subtracting with mean and dividing by standard […]

Primary Sidebar

Subscribe to Python and R Tips and Learn Data Science

Learn Pandas in Python and Tidyverse in R

Tags

Altair Basic NumPy Book Review Data Science Data Science Books Data Science Resources Data Science Roundup Data Visualization Dimensionality Reduction Dropbox Dropbox Free Space Dropbox Tips Emacs Emacs Tips ggplot2 Linux Commands Linux Tips Mac Os X Tips Maximum Likelihood Estimation in R MLE in R NumPy Pandas Pandas 101 Pandas Dataframe Pandas Data Frame pandas groupby() Pandas select columns Pandas select_dtypes Python Python 3 Python Boxplot Python Tips R rstats R Tips Seaborn Seaborn Boxplot Seaborn Catplot Shell Scripting Sparse Matrix in Python tidy evaluation tidyverse tidyverse 101 Vim Vim Tips

RSS RSS

  • How to convert row names to a column in Pandas
  • How to resize an image with PyTorch
  • Fashion-MNIST data from PyTorch
  • Pandas case_when() with multiple examples
  • An Introduction to Statistical Learning: with Applications in Python Is Here
  • 10 Tips to customize ggplot2 title text
  • 8 Plot types with Matplotlib in Python
  • PCA on S&P 500 Stock Return Data
  • Linear Regression with Matrix Decomposition Methods
  • Numpy’s random choice() function

Copyright © 2025 · Lifestyle Pro on Genesis Framework · WordPress · Log in

Go to mobile version