Overview
The compare method in pandas shows the differences between two DataFrames. It compares two data frames, row-wise and column-wise, and presents the differences side by side.
The compare method can only compare DataFrames of the same shape, with exact dimensions and identical row and column labels.
Note: To learn more about pandas, please visit this link.
Syntax
DataFrame.compare(other, align_axis=1, keep_shape=False, keep_equal=False)Parameters
The compare method accepts the following parameters:
- other: This is the DataFrame for comparison.
- align_axis: This indicates the axis of comparison, with 0 for rows, and 1, the default value, for columns.
- keep_shape: This is a boolean parameter. Setting this to True prevents dropping of any row or column, and compare drops rows and columns with all elements same for the two data frames for the default value False.
- keep_equal: This is another boolean parameter. Setting this to True shows equal values between the two DataFrames, while compare shows the positions with the same values for the two data frames as NaN for the default value False.
Example
import pandas as pd data = [['dom', 10], ['chibuge', 15], ['celeste', 14]] df = pd.DataFrame(data, columns = ['Name', 'Age']) data1 = [['dom', 11], ['abhi', 17], ['celeste', 14]] df1 = pd.DataFrame(data1, columns = ['Name', 'Age']) print("Dataframe 1 -- \n") print(df) print("-"*5) print("Dataframe 2 -- \n") print(df1) print("-"*5) print("Dataframe difference -- \n") print(df.compare(df1)) print("-"*5) print("Dataframe difference keeping equal values -- \n") print(df.compare(df1, keep_equal=True)) print("-"*5) print("Dataframe difference keeping same shape -- \n") print(df.compare(df1, keep_shape=True)) print("-"*5) print("Dataframe difference keeping same shape and equal values -- \n") print(df.compare(df1, keep_shape=True, keep_equal=True))
Explanation
- Line 1: We import the pandas module.
- Lines 3–4: We construct a Pandas DataFrame called df from the list called data. df has two columns: Name and Age.
- Lines 6–7: We construct another Pandas DataFrame called df1 from the list called data1. df1 has two columns: Name and Age.
- Lines 9–14: We print df and df1.
- Line 18: We use compare to obtain the difference between the two DataFrames df and df1.
- Line 22: We use compare to obtain the difference between the two DataFrames, df and df1, while setting keep_equal to True. We can see that similar values are not omitted in the printed difference.
- Line 26: We use compare to obtain the difference between the two DataFrames, df and df1, while setting keep_shape to True. We see that the row with the same values for the two DataFrames is not omitted in the printed difference.
- Line 30: We use compare to obtain the difference between the two DataFrames, df and df1, while setting keep_shape and keep_equal to True. We see that the row with the same values for the two DataFrames is not omitted in the printed difference, nor are the values of the positions with the same values for the two DataFrames.
RELATED TAGS
python
communitycreator
pandas
dataframe
There is a simpler solution that is faster and better, and if the numbers are different can even give you quantities differences:
df1_i = df1.set_index(['Date','Fruit','Color']) df2_i = df2.set_index(['Date','Fruit','Color']) df_diff = df1_i.join(df2_i,how='outer',rsuffix='_').fillna(0) df_diff = (df_diff['Num'] - df_diff['Num_'])Here df_diff is a synopsis of the differences. You can even use it to find the differences in quantities. In your example:
Explanation: Similarly to comparing two lists, to do it efficiently we should first order them then compare them (converting the list to sets/hashing would also be fast; both are an incredible improvement to the simple O(N^2) double comparison loop
Note: the following code produces the tables:
df1=pd.DataFrame({ 'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24'], 'Fruit':['Banana','Orange','Apple','Celery'], 'Num':[22.1,8.6,7.6,10.2], 'Color':['Yellow','Orange','Green','Green'], }) df2=pd.DataFrame({ 'Date':['2013-11-24','2013-11-24','2013-11-24','2013-11-24','2013-11-25','2013-11-25'], 'Fruit':['Banana','Orange','Apple','Celery','Apple','Orange'], 'Num':[22.1,8.6,7.6,10.2,22.1,8.6], 'Color':['Yellow','Orange','Green','Green','Red','Orange'], })