How To Compare Two Dataframes with Pandas compare?

In this post, we will learn how to compare two Pandas dataframes and summarize their differences using Pandas compare() function.

Sometimes you may have two similar dataframes and would like to know exactly what those differences are between the two data frames. Starting from Pandas 1.1.0 version, Pandas has a new function compare() that lets you compare two data frames or Series and identify the differences between them and nicely tabulate them.

Let us load Pandas and Numpy.

import pandas as pd
import numpy as np

Let us check Pandas version and make sure we have Pandas version 1.1.0 and above. Otherwise make sure to install the latest version of Pandas using conda/pip install.

pd.__version__
'1.1.0'

We will first create a toy dataframe with three columns and four rows.

# create dataframe
df1 = pd.DataFrame(
    {
       "col1": ["a", "v", "x", "y"],
       "col2": [1.0, 2.0, 3.0, np.nan],
       "col3": [7.0, 8.0, 9.0, 3.0]
     },
    columns=["col1", "col2", "col3"],
)

And this is how our data looks like.

df1

      col1	col2	col3
0	a	1.0	7.0
1	v	2.0	8.0
2	x	3.0	9.0
3	y	NaN	3.0

Let us create second dataframe by copying the first dataframe and making two changes. In this example, we change the values of first row in first column and second row in second column.

# create a copy of dataframe
df2 = df1.copy()
# change the values of few elements
df2.loc[0, 'col1'] = 'zz'
df2.loc[1, 'col2'] = '10'

If we compare it to the first data frame, it differs in just two places in the first two rows.

df2

     col1	col2	col3
0	zz	1	7.0
1	v	10	8.0
2	x	3	9.0
3	y	NaN	3.0

Let us use Pandas compare() function to summarize the differences between the two dataframes. Pandas compare() function outputs the columns and elements that is different between the dataframes. “self” label in the result correspond to the dataframe that we compare and “other” label corresponds to the dataframe that we compare to.

By default, if two corresponding values are equal, they will be shown as NaN. Furthermore, if all values in an entire row / column, the row / column will be omitted from the result. The remaining differences will be aligned on columns.

df1.compare(df2)

            col1	col2
       self	other	self	other
0	a	zz	NaN	NaN
1	NaN	NaN	2.0	10

We can also change how Pandas compare display result. Here with align_axis=”rows” or align_axis=0, Pandas compare() function displays self and other in separate rows.

df1.compare(df2, align_axis='rows')

        col1	 col2
0	self	a	NaN
      other	zz	NaN
1	self	NaN	2
      other	NaN	10

Share this:

Related posts: