Computing standardized values of one or more columns is an important step for many machine learning analysis. For example, if we are using dimentionality reduction techniques like Principal Component Analysis (PCA), we will typically standardize all the variables.
To standardize a variable we subtract each value of the variable by mean of the variable and divide by the standard deviation of the variable. This basically transforms the variable to have normal distribution with zero-mean and unit variance.
Standardization of a variable is also called computing z-scores. It is basically the “the number of standard deviations by which the value is away from mean value of the variable. When the raw value is above mean value, the standardized value or z-score is positive. When the original value of the variable is below the mean value, the standardized value or score is negative.
In this post, we will see three ways to compute standardized scores for multiple variables in a Pandas dataframe.
- First, we will use Pandas functionalities to manually compute standardized scores for all columns at the same time.
- Next, we will use Numpy and compute standarized scores.
- And finally, we will use scikit-learn’s module to compute standardized scores or z-scores of all columns in a data frame.
Let us import the packages needed for computing standardized scores and visualizing them in Python.
import pandas as pd import matplotlib.pyplot as plt import seaborn as sns
We will use palmer penguins dataset available from Seaborn’s built-in datasets and remove missing data to keep it simple.
# load data from Seaborn penguins = sns.load_dataset("penguins") # remove rows with missing values penguins = penguins.dropna()
Since we are only interested in numerical variables, we select the columns that are numeric.
data = penguins.select_dtypes(float)
data.head() bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 0 39.1 18.7 181.0 3750.0 1 39.5 17.4 186.0 3800.0 2 40.3 18.0 195.0 3250.0 4 36.7 19.3 193.0 3450.0 5 39.3 20.6 190.0 3650.0
We can see that each column has very different ranges. We can quickly check the average values each variable and see how different they are.
df=data.mean().reset_index(name="avg") df index avg 0 bill_length_mm 43.992793 1 bill_depth_mm 17.164865 2 flipper_length_mm 200.966967 3 body_mass_g 4207.057057
Using density plots, we can also see how different their distributions are. Using raw data as it is can bias most of the machine learning methods.
Standardizing multiple variables with Pandas
We can standardize all numerical variables in the dataframe using Pandas vectorized functions. Here we compute column means with mean() function and standard deviation with std() function for all the columns/variables in the data frame. We can subtract the column mean and divide by standard deviation to compute standardized values for all columns at the same time.
data_z = (data-data.mean())/(data.std())
Our standardized values should have zero mean for all columns and and unit variance. We can verify that by making a density plot as shown below.
sns.kdeplot(data=data_z)
Let us also check by computing mean and standard deviation on each variable.
data_z.mean() bill_length_mm -2.379811e-15 bill_depth_mm -1.678004e-15 flipper_length_mm 2.110424e-16 body_mass_g 1.733682e-17 dtype: float64
Let us check the standard deviations of the standaridized scores.
data_z.std() bill_length_mm 1.0 bill_depth_mm 1.0 flipper_length_mm 1.0 body_mass_g 1.0 dtype: float64
How To Compute Standardized Values or Z-score with Numpy?
We can also use NumPy and compute standardized scores on multiple columns using vectorized operations. First, let us convert the pandas dataframe into a numpy array using to_numpy() function available in Pandas.
data_mat = data.to_numpy()
We can use NumPy’s mean() and std() function to compute mean and standard deviations and use them to compute the standardized scores. Note that we have specified axis to compute column mean and std().
data_z_np = (data_mat - np.mean(data_mat, axis=0)) / np.std(data_mat, axis=0)
With NumPy, we get our standardized scores as a NumPy array. Let us us convert the numpy array into a Pandas dataframe using DataFrame() function.
data_z_np_df = pd.DataFrame(data_z_np, index=data.index, columns=data.columns)
And this is our new standardized data and we can check the mean and standard deviation as shown before.
data_z_np_df.head() bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 0 -0.896042 0.780732 -1.426752 -0.568475 1 -0.822788 0.119584 -1.069474 -0.506286 2 -0.676280 0.424729 -0.426373 -1.190361 4 -1.335566 1.085877 -0.569284 -0.941606 5 -0.859415 1.747026 -0.783651 -0.692852
How To Standardize Multiple Variables with scikit-learn?
We can standardize one or more variables using scikit-learn’s preprocessing module. For standardizing variables, we use StandardScaler from sklearn.preprocessing.
from sklearn.preprocessing import StandardScaler
We follow the typical scikity-learn approach, first by creating an instance of StandardScaler() and fitting the data to compute standardized scores for all variables.
nrmlzd = StandardScaler() data_std =nrmlzd.fit_transform(data)
scikit-learn also gives the results as a numpy array and we can create Pandas dataframe as before.
data_std= pd.DataFrame(data_std, index=data.index, columns=data.columns) data_std
bill_length_mm bill_depth_mm flipper_length_mm body_mass_g 0 -0.896042 0.780732 -1.426752 -0.568475 1 -0.822788 0.119584 -1.069474 -0.506286 2 -0.676280 0.424729 -0.426373 -1.190361 4 -1.335566 1.085877 -0.569284 -0.941606 5 -0.859415 1.747026 -0.783651 -0.692852
Let us verify the mean and standard deviation of the standardized scores.
data_std.mean() bill_length_mm 1.026873e-16 bill_depth_mm <a href="https://cmdlinetips.com/2020/06/principal-component-analysis-with-penguins-data-in-python/"></a> 3.267323e-16 flipper_length_mm 5.697811e-16 body_mass_g 2.360474e-16 dtype: float64
data_std.std() bill_length_mm 1.001505 bill_depth_mm 1.001505 flipper_length_mm 1.001505 body_mass_g 1.001505 dtype: float64
You might notice that the standardized scores computed by Pandas differ from the scores computed by NumPy and scikit-learn. This is most likely due to the differences in the way sample standard deviation computed by Pandas is different from NumPy and scikit-learn.
However, they are not wildly different as we can see they differn in third digit. Here is the density plot of standardized scores from scikit-learn and we can verify that it has mean zero and looks the same as computed by Pandas.
sns.kdeplot(data=data_std)
Are you wondering how much difference whether you standardize the variables or not can make in doing analysis? Check out the relevance of standardizing the data while doing PCA here.
1 comment
Comments are closed.