pandas 基于通过关键字的链接识别具有公共列的不相同 Dataframe 中的差异

zc0qhyus  于 2022-12-16  发布在  其他
关注(0)|答案(3)|浏览(182)
import pandas as pd

data = {'A': ['123','456','789'], 'B': ['D1','D4','D7'], 'C':['D2','D5','D8'], 'D':['D3','D6','D9']}
df1 = pd.DataFrame(data)

data2 = {'A': ['123','789','111','222'], 'B': ['D1','D7','D11','D14'], 'C':['D10','D8','D12','D15'], 'D':['D3','D9','D13','16']}
df2 = pd.DataFrame(data2)

2个框架之间的主要关键链接是列“A”。
期望输出-

  1. df 1-df 2中不存在的行(“A”= 456)
  2. df 2-df 1中不存在的行(“A”= 111且“A”= 222)
    1.具有差异的公共行-('A' = 123 -列'C' = D2与D10
yzckvree

yzckvree1#

**Q1:**d1-df 2中不存在的行:

q1=df1[~df1['A'].isin(df2['A'])]

'''
     A   B   C   D
1  456  D4  D5  D6
'''

**Q2:**df 2-df 1中不存在的行:

q2=df2[~df2['A'].isin(df1['A'])]

'''
     A    B    C    D
2  111  D11  D12  D13
3  222  D14  D15   16
'''

**第3季度:**有差异的共同行:

q3=pd.concat([df1[df1['A'].isin(df2['A'])],df2[df2['A'].isin(df1['A'])]]).drop_duplicates().groupby('A').agg(list)
mask=q3.applymap(lambda x: pd.Series(x).is_unique if len(x) > 1 else False)
q3=q3[mask].stack().reset_index()

'''
     A    level_1          0
0  123       C        [D2, D10]
'''
g6baxovj

g6baxovj2#

对于前两个问题,我的答案与@Bushmaster的答案相似,但在A常见的行之间,显示差异的方法略有不同。

df1x2 = df1.loc[~df1['A'].isin(df2['A'])]
df2x1 = df2.loc[~df2['A'].isin(df1['A'])]
dfcom = df1.merge(df2, on='A', suffixes=['_1', '_2'])

然后,为了突出显示dfcom每行中的差异:

def show_diffs(s):
    a = {k[:-2]: v for k, v in s.items() if k.endswith('_1')}
    b = {k[:-2]: v for k, v in s.items() if k.endswith('_2')}
    return {k: (a[k], b[k]) for k in a if a[k] != b[k]}

>>> dfcom.assign(diffs=dfcom.apply(show_diffs, axis=1))
     A B_1 C_1 D_1 B_2  C_2 D_2                 diffs
0  123  D1  D2  D3  D1  D10  D3  {'C': ('D2', 'D10')}
1  789  D7  D8  D9  D7   D8  D9                    {}

# or, just the differences
>>> dfcom[['A']].assign(diffs=dfcom.apply(show_diffs, axis=1))
     A                 diffs
0  123  {'C': ('D2', 'D10')}
1  789                    {}

如果公共dfcom不是太大,则另一种可能性是直观地呈现差异:

def highlight(df):
    cols_1 = [k for k in df.columns if k.endswith('_1')]
    cols_2 = [k for k in df.columns if k.endswith('_2')]
    a = df[cols_1].set_axis([k[:-2] for k in cols_1], axis=1)
    b = df[cols_2].set_axis([k[:-2] for k in cols_2], axis=1)
    h = pd.concat([
        # just in case the columns are somehow in a different order
        (a != b[a.columns]).set_axis(cols_1, axis=1),
        (b != a[b.columns]).set_axis(cols_2, axis=1),
    ], axis=1).applymap(
        lambda v: 'color:red; font-weight:bold' if v else 'color:lightblue')
    return h

dfcom.style.apply(highlight, axis=None)

0pizxfdo

0pizxfdo3#

下面是一种方法来完成您的要求:

  • 将索引更改为A
  • 预先计算A的公共值和相异值的子集
  • 对于df1df2中的非公共行,通过索引(即,通过A的值)阅读适当的行
  • 为标记不匹配值的公共行创建掩码,并使用该掩码在具有不同值的位置构造V1 vs V2样式字符串,其中相似值用NaN标记。
# pre-calculation of partitioning of `A` values, to avoid duplicate operations
a1, a2 = df1.set_index('A'), df2.set_index('A')
s1, s2 = set(a1.index), set(a2.index)
sCommon = s1 & s2
lCommon = list(sCommon)

# calculation of outputs (rows only in df1, rows only in df2, and common rows)
out1, out2 = a1.loc[list(s1 - sCommon),], a2.loc[list(s2 - sCommon),]
mask = a1.loc[lCommon,] != a2.loc[lCommon,]
out3 = (a1.loc[lCommon,].astype(str)[mask] + ' vs ' + a2.loc[lCommon,].astype(str)[mask])

输出:

out1
      B   C   D
A
456  D4  D5  D6

out2
       B    C    D
A
222  D14  D15   16
111  D11  D12  D13

out3
       B          C    D
A
123  NaN  D2 vs D10  NaN
789  NaN        NaN  NaN

相关问题