pandas 根据特定列选择dataframe第一行,如果特定列中的列值以前出现过,则删除行

q3aa0525  于 2023-04-04  发布在  其他
关注(0)|答案(1)|浏览(151)

我有3个 Dataframe
DF1

CAT1    CAT2    CAT3    ID_X
A1        B      C       X1
A1        B      C       X2
A2        B      C       X3
A2        B      C       X4
A2        B      C       X5
A3        B      C       X6
A4        B      C       X7

DF2

CAT1    CAT2    CAT3    ID_Y
A1       B       C       Y1
A1       B       C       Y2
A1       B       C       Y3
A2       B       C       Y4
A2       B       C       Y5
A3       B       C       Y6
A5       B       C       Y7

DF3

ID_X    ID_Y    ID_XY
X1      Y1      X1Y1
X2      Y3      X2Y3
X3      Y4      X3Y4
X4      Y5      X4Y5
X6      Y6      X6Y6

有三个步骤可以得到最终结果

Step1:在CAT 1、CAT 2和CAT 3上内连接df 1和df 2,根据ID_X和ID_Y创建ID_XY列,得到新的 Dataframe df_merge

脚本

df_merge = pd.merge(df1, df2, how="inner", on=["CAT1", "CAT2","CAT3"])
df_merge['ID_XY'] = df_merge['ID_X']+merge1['ID_Y']

Step2:使用df 3从df_merge中删除具有公共ID_XY的行

脚本

df_merge1 = df_merge[~df_merge.ID_XY.isin(df3.ID_XY)]

df_merge1

CAT1    CAT2    CAT3    ID_X    ID_Y    ID_XY
A1       B       C        X1     Y2     X1Y2
A1       B       C        X1     Y3     X1Y3
A1       B       C        X2     Y1     X2Y1
A1       B       C        X2     Y2     X2Y2
A2       B       C        X3     Y5     X3Y5
A2       B       C        X4     Y4     X4Y4
A2       B       C        X5     Y4     X5Y4
A2       B       C        X5     Y5     X5Y5

Step 3:根据CAT 1、CAT 2、CAT 3、ID_X、ID_Y选择 Dataframe 第一行,如果ID_Y中的列值之前出现过,则删除行
最终输出将是步骤3的最终结果:输出如下所示
df_final

CAT1    CAT2    CAT3    ID_X    ID_Y    ID_XY
A1        B      C      X1       Y2     X1Y2    
A1        B      C      X2       Y1     X2Y1
A2        B      C      X3       Y5     X3Y5
A2        B      C      X4       Y4     X4Y4

编辑1

**注意:**将ID_X视为JOB,ID_Y视为候选项!在df_merge1中,如果我为X1选择Y2(给定其他列常量),则无法为相同的X1选择Y3。类似地,如果X4分配给Y 4(给定相同的CAT 1,CAT 2和CAT 3),则Y 4无法分配给X5。

编辑二:
我试试

merge3 = df_merge1.copy()
df_X1 = merge3[merge3['ID_X']=='X1']
df_X2 = merge3[merge3['ID_X']=='X2']
df_X3 = merge3[merge3['ID_X']=='X3']
df_X4 = merge3[merge3['ID_X']=='X4']
df_X5 = merge3[merge3['ID_X']=='X5']

selected_list = []

df_X1 = df_X1.iloc[:1]
selected_list.append(df_X1['ID_Y'].values[0])
df_X2 = df_X2[~df_X2.ID_Y.isin(selected_list)].iloc[:1]
selected_list.append(df_X2['ID_Y'].values[0])
df_X3 = df_X3[~df_X3.ID_Y.isin(selected_list)].iloc[:1]
selected_list.append(df_X3['ID_Y'].values[0])
df_X4 = df_X4[~df_X4.ID_Y.isin(selected_list)].iloc[:1]
selected_list.append(df_X4['ID_Y'].values[0])
df_X5 = df_X5[~df_X5.ID_Y.isin(selected_list)].iloc[:1]
df_output = pd.concat([df_X1,df_X2,df_X3,df_X4,df_X5])
  • 任何帮助都将不胜感激 *
gmxoilav

gmxoilav1#

如果需要通过CAT1/CAT2/CAT3/ID_X删除重复项,然后通过ID_Y用途:

df = df_merge1.drop_duplicates(['CAT1','CAT2','CAT3','ID_X']).drop_duplicates('ID_Y')
print (df)
  CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
0   A1    B    C   X1   Y2  X1Y2
2   A1    B    C   X2   Y1  X2Y1
4   A2    B    C   X3   Y5  X3Y5
5   A2    B    C   X4   Y4  X4Y4

或者:

df = df_merge1.drop_duplicates(['CAT1','CAT2','CAT3','ID_Y']).drop_duplicates('ID_X')
print (df)
  CAT1 CAT2 CAT3 ID_X ID_Y ID_XY
0   A1    B    C   X1   Y2  X1Y2
2   A1    B    C   X2   Y1  X2Y1
4   A2    B    C   X3   Y5  X3Y5
5   A2    B    C   X4   Y4  X4Y4

相关问题