Pandas“只能比较标签相同的DataFrame对象”错误

carvr3hs 于 2022-11-05 发布在其他

关注(0)|答案(8)|浏览(427)

我使用Pandas来比较加载到两个 Dataframe （uat，prod）中的两个文件的输出：...

uat = uat[['Customer Number','Product']]
prod = prod[['Customer Number','Product']]
print uat['Customer Number'] == prod['Customer Number']
print uat['Product'] == prod['Product']
print uat == prod

The first two match exactly:
74357    True
74356    True
Name: Customer Number, dtype: bool
74357    True
74356    True
Name: Product, dtype: bool

对于第三个打印，我得到一个错误：只能比较具有相同标签的DataFrame对象。如果前两个比较正确，那么第三个有什么问题？
谢谢

pandas

来源：https://stackoverflow.com/questions/18548370/pandas-can-only-compare-identically-labeled-dataframe-objects-error

8条答案

按热度按时间

2ekbmq321#

这里有一个小例子来演示这一点（它只适用于DataFrame，而不适用于Series，直到Pandas 0.19才适用于两者）：

In [1]: df1 = pd.DataFrame([[1, 2], [3, 4]])

In [2]: df2 = pd.DataFrame([[3, 4], [1, 2]], index=[1, 0])

In [3]: df1 == df2
Exception: Can only compare identically-labeled DataFrame objects

一种解决方案是先sort the index（注：某些函数需要排序索引）：

In [4]: df2.sort_index(inplace=True)

In [5]: df1 == df2
Out[5]: 
      0     1
0  True  True
1  True  True

注意：==也是sensitive to the order of columns，因此您可能必须使用sort_index(axis=1)：

In [11]: df1.sort_index().sort_index(axis=1) == df2.sort_index().sort_index(axis=1)
Out[11]: 
      0     1
0  True  True
1  True  True

注意：这仍然可能引发（如果索引/列在排序后没有相同的标签）。

赞(0）回复(0）举报 2022-11-05

pn9klfpd2#

如果不需要比较索引列，也可以尝试删除该列：

print(df1.reset_index(drop=True) == df2.reset_index(drop=True))

我已经在单元测试中使用了相同的技术，如下所示：

from pandas.util.testing import assert_frame_equal

assert_frame_equal(actual.reset_index(drop=True), expected.reset_index(drop=True))

赞(0）回复(0）举报 2022-11-05

zaqlnxep3#

在提出这个问题时，Pandas中还没有其他函数来测试相等性，但不久前添加了这个函数：pandas.equals
您可以这样使用它：

df1.equals(df2)

与==的一些不同之处在于：

您没有得到问题中描述的错误
它返回一个简单的布尔值。
相同位置中的NaN值被视为相等
2个 Dataframe 需要具有相同的dtype才能被视为相等，请参见this stackoverflow question
编辑：

正如@paperskilltrees回答中指出的，索引对齐很重要。除了那里提供的解决方案之外，另一个选择是在比较DataFrames之前先sort the index of the DataFrames。对于df1，那将是df1.sort_index(inplace=True)。

赞(0）回复(0）举报 2022-11-05

pprl5pva4#

比较两个DataFrame时，必须确保第一个DataFrame中的记录数与第二个DataFrame中的记录数相匹配。在我们的示例中，两个DataFrame中的每个都有4条记录，即4个产品和4个价格。
例如，如果其中一个DataFrame有5个产品，而另一个DataFrame有4个产品，并且您尝试运行比较，则会出现以下错误：

值错误：只能比较标签相同的系列对象

这应该行得通

import pandas as pd
import numpy as np

firstProductSet = {'Product1': ['Computer','Phone','Printer','Desk'],
                   'Price1': [1200,800,200,350]
                   }
df1 = pd.DataFrame(firstProductSet,columns= ['Product1', 'Price1'])

secondProductSet = {'Product2': ['Computer','Phone','Printer','Desk'],
                    'Price2': [900,800,300,350]
                    }
df2 = pd.DataFrame(secondProductSet,columns= ['Product2', 'Price2'])

df1['Price2'] = df2['Price2'] #add the Price2 column from df2 to df1

df1['pricesMatch?'] = np.where(df1['Price1'] == df2['Price2'], 'True', 'False')  #create new column in df1 to check if prices match
df1['priceDiff?'] = np.where(df1['Price1'] == df2['Price2'], 0, df1['Price1'] - df2['Price2']) #create new column in df1 for price diff 
print (df1)

https://datatofish.com/compare-values-dataframes/中示例

赞(0）回复(0）举报 2022-11-05

ru9i0ody5#

Flyingdutchman's answer是 * 伟大的 * 但 * 错误的 *：它使用DataFrame.equals，在您的情况下，它将返回False，而您希望使用DataFrame.eq，它将返回True。
看起来DataFrame.equals忽略了 Dataframe 的索引，而DataFrame.eq使用 Dataframe 的索引进行对齐，然后比较对齐后的值。这就是quote的主要陷阱：
以下是需要牢记的基本原则：数据对齐是固有的。2除非您明确地这样做，否则标签和数据之间的链接将不会断开。
正如我们在下面的例子中所看到的，数据对齐既不会被破坏，也不会被强制执行，除非有明确的请求。
1.未给出明确的对齐说明：==又名DataFrame.__eq__，

In [1]: import pandas as pd
   In [2]: df1 = pd.DataFrame(index=[0, 1, 2], data={'col1':list('abc')})
   In [3]: df2 = pd.DataFrame(index=[2, 0, 1], data={'col1':list('cab')})
   In [4]: df1 == df2
   ---------------------------------------------------------------------------
   ...
   ValueError: Can only compare identically-labeled DataFrame objects

1.对齐已明确中断：第一个是

In [5]: df1.equals(df2)
    Out[5]: False

    In [9]: df1.values == df2.values
    Out[9]: 
    array([[False],
           [False],
           [False]])

    In [10]: (df1.values == df2.values).all().all()
    Out[10]: False

1.明确强制执行对齐：DataFrame.eq、DataFrame.sort_index()、

In [6]: df1.eq(df2)
    Out[6]: 
       col1
    0  True
    1  True
    2  True

    In [8]: df1.eq(df2).all().all()
    Out[8]: True

我的答案是Pandas版的1.0.3。

赞(0）回复(0）举报 2022-11-05

zlwx9yxi6#

这里我展示了一个如何处理这个错误的完整例子。我已经添加了带零的行。你可以从csv或任何其他来源获得你的 Dataframe 。

import pandas as pd
import numpy as np

# df1 with 9 rows

df1 = pd.DataFrame({'Name':['John','Mike','Smith','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
    'Age':[23,45,12,34,27,44,28,39,40]})

# df2 with 8 rows

df2 = pd.DataFrame({'Name':['John','Mike','Wale','Marry','Tom','Menda','Bolt','Yuswa',],
    'Age':[25,45,14,34,26,44,29,42]})

# get lengths of df1 and df2

df1_len = len(df1)
df2_len = len(df2)

diff = df1_len - df2_len

rows_to_be_added1 = rows_to_be_added2 = 0

# rows_to_be_added1 = np.zeros(diff)

if diff < 0:
    rows_to_be_added1 = abs(diff)
else:
    rows_to_be_added2 = diff

# add empty rows to df1

if rows_to_be_added1 > 0:
    df1 = df1.append(pd.DataFrame(np.zeros((rows_to_be_added1,len(df1.columns))),columns=df1.columns))

# add empty rows to df2

if rows_to_be_added2 > 0:
    df2 = df2.append(pd.DataFrame(np.zeros((rows_to_be_added2,len(df2.columns))),columns=df2.columns))

# at this point we have two dataframes with the same number of rows, and maybe different indexes

# drop the indexes of both, so we can compare the dataframes and other operations like update etc.

df2.reset_index(drop=True, inplace=True)
df1.reset_index(drop=True, inplace=True)

# add a new column to df1

df1['New_age'] = None

# compare the Age column of df1 and df2, and update the New_age column of df1 with the Age column of df2 if they match, else None

df1['New_age'] = np.where(df1['Age'] == df2['Age'], df2['Age'], None)

# drop rows where Name is 0.0

df2 = df2.drop(df2[df2['Name'] == 0.0].index)

# now we don't get the error ValueError: Can only compare identically-labeled Series objects

赞(0）回复(0）举报 2022-11-05

pbwdgjma7#

我找到了我的案例中错误的来源：
问题是列名称列表意外地包含在另一个列表中。
请考虑以下示例：

column_names=['warrior','eat','ok','monkeys']

df_good = pd.DataFrame(np.ones(shape=(6,4)),columns=column_names)
df_good['ok'] < df_good['monkeys']

>>> 0    False
    1    False
    2    False
    3    False
    4    False
    5    False

df_bad = pd.DataFrame(np.ones(shape=(6,4)),columns=[column_names])
df_bad ['ok'] < df_bad ['monkeys']

>>> ValueError: Can only compare identically-labeled DataFrame objects

问题是，您无法从视觉上区分坏的DataFrame和好的DataFrame。

赞(0）回复(0）举报 2022-11-05

svujldwt8#

在我的例子中，我只是在创建 Dataframe 时直接写入paramcolumns，因为来自一个sql查询的数据带有名称，而在另一个sql查询中没有名称

赞(0）回复(0）举报 2022-11-05