In this post, we will learn how to compare two Pandas dataframes and summarize their differences using Pandas compare() function.
Sometimes you may have two similar dataframes and would like to know exactly what those differences are between the two data frames. Starting from Pandas 1.1.0 version, Pandas has a new function compare() that lets you compare two data frames or Series and identify the differences between them and nicely tabulate them.
Let us load Pandas and Numpy.
import pandas as pd import numpy as np
Let us check Pandas version and make sure we have Pandas version 1.1.0 and above. Otherwise make sure to install the latest version of Pandas using conda/pip install.
pd.__version__ '1.1.0'
We will first create a toy dataframe with three columns and four rows.
# create dataframe df1 = pd.DataFrame( { "col1": ["a", "v", "x", "y"], "col2": [1.0, 2.0, 3.0, np.nan], "col3": [7.0, 8.0, 9.0, 3.0] }, columns=["col1", "col2", "col3"], )
And this is how our data looks like.
df1 col1 col2 col3 0 a 1.0 7.0 1 v 2.0 8.0 2 x 3.0 9.0 3 y NaN 3.0
Let us create second dataframe by copying the first dataframe and making two changes. In this example, we change the values of first row in first column and second row in second column.
# create a copy of dataframe df2 = df1.copy() # change the values of few elements df2.loc[0, 'col1'] = 'zz' df2.loc[1, 'col2'] = '10'
If we compare it to the first data frame, it differs in just two places in the first two rows.
df2 col1 col2 col3 0 zz 1 7.0 1 v 10 8.0 2 x 3 9.0 3 y NaN 3.0
Let us use Pandas compare() function to summarize the differences between the two dataframes. Pandas compare() function outputs the columns and elements that is different between the dataframes. “self” label in the result correspond to the dataframe that we compare and “other” label corresponds to the dataframe that we compare to.
By default, if two corresponding values are equal, they will be shown as NaN. Furthermore, if all values in an entire row / column, the row / column will be omitted from the result. The remaining differences will be aligned on columns.
df1.compare(df2) col1 col2 self other self other 0 a zz NaN NaN 1 NaN NaN 2.0 10
We can also change how Pandas compare display result. Here with align_axis=”rows” or align_axis=0, Pandas compare() function displays self and other in separate rows.
df1.compare(df2, align_axis='rows') col1 col2 0 self a NaN other zz NaN 1 self NaN 2 other NaN 10