How to compare two dataframes in Python

Overview

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side.

The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.

Note: To learn more about pandas, please visit this link.

Syntax

DataFrame.compare(other, align_axis=1, keep_shape=False, keep_equal=False)

Parameters

The compare method accepts the following parameters:

  • other: This is the DataFrame for comparison.
  • align_axis: This indicates the axis of comparison, with 0 for rows, and 1, the default value, for columns.
  • keep_shape: This is a boolean parameter. Setting this to True prevents dropping of any row or column, and compare drops rows and columns with all elements same for the two data frames for the default value False.
  • keep_equal: This is another boolean parameter. Setting this to True shows equal values between the two DataFrames, while compare shows the positions with the same values for the two data frames as NaN for the default value False.

Example

import pandas as pd

data = [['dom', 10], ['chibuge', 15], ['celeste', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])

data1 = [['dom', 11], ['abhi', 17], ['celeste', 14]]
df1 = pd.DataFrame(data1, columns = ['Name', 'Age'])

print("Dataframe 1 -- \n")
print(df)

print("-"*5)
print("Dataframe 2 -- \n")
print(df1)

print("-"*5)
print("Dataframe difference -- \n")
print(df.compare(df1))

print("-"*5)
print("Dataframe difference keeping equal values -- \n")
print(df.compare(df1, keep_equal=True))

print("-"*5)
print("Dataframe difference keeping same shape -- \n")
print(df.compare(df1, keep_shape=True))
 
print("-"*5)
print("Dataframe difference keeping same shape and equal values -- \n")
print(df.compare(df1, keep_shape=True, keep_equal=True))

Explanation

  • Line 1: We import the pandas module.
  • Lines 3–4: We construct a Pandas DataFrame called df from the list called data. df has two columns: Name and Age.
  • Lines 6–7: We construct another Pandas DataFrame called df1 from the list called data1. df1 has two columns: Name and Age.
  • Lines 9–14: We print df and df1.
  • Line 18: We use compare to obtain the difference between the two DataFrames df and df1.
  • Line 22: We use compare to obtain the difference between the two DataFrames, df and df1, while setting keep_equal to True. We can see that similar values are not omitted in the printed difference.
  • Line 26: We use compare to obtain the difference between the two DataFrames, df and df1, while setting keep_shape to True. We see that the row with the same values for the two DataFrames is not omitted in the printed difference.
  • Line 30: We use compare to obtain the difference between the two DataFrames, df and df1, while setting keep_shape and keep_equal to True. We see that the row with the same values for the two DataFrames is not omitted in the printed difference, nor are the values of the positions with the same values for the two DataFrames.

RELATED TAGS

python

communitycreator

pandas

dataframe

There is a simpler solution that is faster and better, and if the numbers are different can even give you quantities differences:

df1_i = df1.set_index(['Date','Fruit','Color'])
df2_i = df2.set_index(['Date','Fruit','Color'])
df_diff = df1_i.join(df2_i,how='outer',rsuffix='_').fillna(0)
df_diff = (df_diff['Num'] - df_diff['Num_'])

Here df_diff is a synopsis of the differences. You can even use it to find the differences in quantities. In your example:

How to compare two dataframes in Python

Explanation: Similarly to comparing two lists, to do it efficiently we should first order them then compare them (converting the list to sets/hashing would also be fast; both are an incredible improvement to the simple O(N^2) double comparison loop

Note: the following code produces the tables:

df1=pd.DataFrame({
    'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'],
    'Fruit':['Banana','Orange','Apple','Celery'],
    'Num':[22.1,8.6,7.6,10.2],
    'Color':['Yellow','Orange','Green','Green'],
})
df2=pd.DataFrame({
    'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'],
    'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'],
    'Num':[22.1,8.6,7.6,10.2,22.1,8.6],
    'Color':['Yellow','Orange','Green','Green','Red','Orange'],
})

How do you compare DataFrames in Python?

The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side. The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.

How do you compare data types in two DataFrames in Python?

Steps to Compare Values Between two Pandas DataFrames.
Step 1: Prepare the datasets to be compared. To start, let's say that you have the following two datasets that you want to compare: ... .
Step 2: Create the two DataFrames. ... .
Step 3: Compare the values between the two Pandas DataFrames..

How do you compare two DataFrames the same?

DataFrame - equals() function The equals() function is used to test whether two objects contain the same elements. This function allows two Series or DataFrames to be compared against each other to see if they have the same shape and elements. NaNs in the same location are considered equal.

How do I compare two large DataFrames in Python?

Comparison of Two Data Sets using Python.
datacompy : is a package to compare two DataFrames. ... .
datacompy takes two dataframes as input and gives us a human-readable report containing statistics that lets us know the similarities and dissimilarities between the two dataframes..