How To Compute Z-scores in Python

Standardizing A Variable in Python
Standardizing A Variable in Python

Computing standardized values of one or more columns is an important step for many machine learning analysis. For example, if we are using dimentionality reduction techniques like Principal Component Analysis (PCA), we will typically standardize all the variables.

To standardize a variable we subtract each value of the variable by mean of the variable and divide by the standard deviation of the variable. This basically transforms the variable to have normal distribution with zero-mean and unit variance.

Standardizing A Variable in Python

Standardization of a variable is also called computing z-scores. It is basically the “the number of standard deviations by which the value is away from mean value of the variable. When the raw value is above mean value, the standardized value or z-score is positive. When the original value of the variable is below the mean value, the standardized value or score is negative.

In this post, we will see three ways to compute standardized scores for multiple variables in a Pandas dataframe.

  1. First, we will use Pandas functionalities to manually compute standardized scores for all columns at the same time.
  2. Next, we will use Numpy and compute standarized scores.
  3. And finally, we will use scikit-learn’s module to compute standardized scores or z-scores of all columns in a data frame.

Let us import the packages needed for computing standardized scores and visualizing them in Python.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

We will use palmer penguins dataset available from Seaborn’s built-in datasets and remove missing data to keep it simple.

# load data from Seaborn
penguins = sns.load_dataset("penguins")
# remove rows with missing values
penguins = penguins.dropna()

Since we are only interested in numerical variables, we select the columns that are numeric.

data = penguins.select_dtypes(float)
data.head()

bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	39.1	18.7	181.0	3750.0
1	39.5	17.4	186.0	3800.0
2	40.3	18.0	195.0	3250.0
4	36.7	19.3	193.0	3450.0
5	39.3	20.6	190.0	3650.0

We can see that each column has very different ranges. We can quickly check the average values each variable and see how different they are.

df=data.mean().reset_index(name="avg")
df

index	avg
0	bill_length_mm	43.992793
1	bill_depth_mm	17.164865
2	flipper_length_mm	200.966967
3	body_mass_g	4207.057057

Using density plots, we can also see how different their distributions are. Using raw data as it is can bias most of the machine learning methods.

Multiple Density plot of raw data

Standardizing multiple variables with Pandas

We can standardize all numerical variables in the dataframe using Pandas vectorized functions. Here we compute column means with mean() function and standard deviation with std() function for all the columns/variables in the data frame. We can subtract the column mean and divide by standard deviation to compute standardized values for all columns at the same time.

data_z = (data-data.mean())/(data.std())

Our standardized values should have zero mean for all columns and and unit variance. We can verify that by making a density plot as shown below.

sns.kdeplot(data=data_z)
Density plot of Standardized Variables with Pandas

Let us also check by computing mean and standard deviation on each variable.

data_z.mean()
bill_length_mm      -2.379811e-15
bill_depth_mm       -1.678004e-15
flipper_length_mm    2.110424e-16
body_mass_g          1.733682e-17
dtype: float64

Let us check the standard deviations of the standaridized scores.

data_z.std()

bill_length_mm       1.0
bill_depth_mm        1.0
flipper_length_mm    1.0
body_mass_g          1.0
dtype: float64

How To Compute Standardized Values or Z-score with Numpy?

We can also use NumPy and compute standardized scores on multiple columns using vectorized operations. First, let us convert the pandas dataframe into a numpy array using to_numpy() function available in Pandas.

data_mat = data.to_numpy()

We can use NumPy’s mean() and std() function to compute mean and standard deviations and use them to compute the standardized scores. Note that we have specified axis to compute column mean and std().

data_z_np = (data_mat - np.mean(data_mat, axis=0)) / np.std(data_mat, axis=0)

With NumPy, we get our standardized scores as a NumPy array. Let us us convert the numpy array into a Pandas dataframe using DataFrame() function.

data_z_np_df = pd.DataFrame(data_z_np, 
                            index=data.index, 
                            columns=data.columns)

And this is our new standardized data and we can check the mean and standard deviation as shown before.

data_z_np_df.head()
bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	-0.896042	0.780732	-1.426752	-0.568475
1	-0.822788	0.119584	-1.069474	-0.506286
2	-0.676280	0.424729	-0.426373	-1.190361
4	-1.335566	1.085877	-0.569284	-0.941606
5	-0.859415	1.747026	-0.783651	-0.692852

How To Standardize Multiple Variables with scikit-learn?

We can standardize one or more variables using scikit-learn’s preprocessing module. For standardizing variables, we use StandardScaler from sklearn.preprocessing.

from sklearn.preprocessing import StandardScaler

We follow the typical scikity-learn approach, first by creating an instance of StandardScaler() and fitting the data to compute standardized scores for all variables.

nrmlzd = StandardScaler()
data_std =nrmlzd.fit_transform(data)

scikit-learn also gives the results as a numpy array and we can create Pandas dataframe as before.

data_std= pd.DataFrame(data_std, 
                       index=data.index,
                       columns=data.columns)
data_std

bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g
0	-0.896042	0.780732	-1.426752	-0.568475
1	-0.822788	0.119584	-1.069474	-0.506286
2	-0.676280	0.424729	-0.426373	-1.190361
4	-1.335566	1.085877	-0.569284	-0.941606
5	-0.859415	1.747026	-0.783651	-0.692852

Let us verify the mean and standard deviation of the standardized scores.

data_std.mean()
bill_length_mm       1.026873e-16
bill_depth_mm   <a href="https://cmdlinetips.com/2020/06/principal-component-analysis-with-penguins-data-in-python/"></a>     3.267323e-16
flipper_length_mm    5.697811e-16
body_mass_g          2.360474e-16
dtype: float64
data_std.std()
bill_length_mm       1.001505
bill_depth_mm        1.001505
flipper_length_mm    1.001505
body_mass_g          1.001505
dtype: float64

You might notice that the standardized scores computed by Pandas differ from the scores computed by NumPy and scikit-learn. This is most likely due to the differences in the way sample standard deviation computed by Pandas is different from NumPy and scikit-learn.

However, they are not wildly different as we can see they differn in third digit. Here is the density plot of standardized scores from scikit-learn and we can verify that it has mean zero and looks the same as computed by Pandas.

sns.kdeplot(data=data_std)
Density plot of Standardized Variables: sklearn StandardScalar

Are you wondering how much difference whether you standardize the variables or not can make in doing analysis? Check out the relevance of standardizing the data while doing PCA here.