pandas 使用列表列和包含操作进行panda合并

idfiyjo8  于 2023-01-19  发布在  其他
关注(0)|答案(1)|浏览(75)

下面给出了两个 Dataframe

multi_df = pd.DataFrame({'multi_project_ID': ["Combo_1","Combo_2","Combo_3","Combo_4"],
               'multi_items':[['Chips','Biscuits','Chocolates'],['Alcoholic Drinks','Juices','Fruits'],['Plants','Veggies','Chips'],['Cars']],
                'multi_labels':[[1,2,3],[4,5,6],[8,9,10],[11]]})

single_df = pd.DataFrame({'single_project_ID': ["ABC_1","DEF_2","JKL_3","MNO_3"],
                      'single_items':[['Chips'],['Alcoholic Drinks'],['Biscuits'],['Smoking']],
                      'single_labels':[[1],[4],[8],[9]]})

我想做以下几点
a)检查multi_items清单项目下是否存在single_items清单项目。
b)如果是,则提取对应匹配项的multi_labelsmulti_project_id
c)如果没有项目存在/匹配,则输入NA
所以,我尝试了下面的方法,但是不起作用。我不知道从哪里开始。

print(single_df.groupby('single_labels').sum()['single_items'].apply(lambda x: list(set(x))).reset_index())

我希望我的输出如下所示

ryhaxcpt

ryhaxcpt1#

创建helper列tmp并将DataFrame.explode用于移除列表,然后使用自定义函数聚合列表-仅用于重复值(长度更大,如1):

f = lambda x: list(x) if len(x) > 1 else x
df = multi_df.assign(tmp=multi_df['multi_items']).explode('tmp').groupby('tmp').agg(f)
print (df)

                    multi_project_ID  \
tmp                                    
Alcoholic Drinks             Combo_2   
Biscuits                     Combo_1   
Cars                         Combo_4   
Chips             [Combo_1, Combo_3]   
Chocolates                   Combo_1   
Fruits                       Combo_2   
Juices                       Combo_2   
Plants                       Combo_3   
Veggies                      Combo_3   

                                                        multi_items  \
tmp                                                                   
Alcoholic Drinks                 [Alcoholic Drinks, Juices, Fruits]   
Biscuits                              [Chips, Biscuits, Chocolates]   
Cars                                                         [Cars]   
Chips             [[Chips, Biscuits, Chocolates], [Plants, Veggi...   
Chocolates                            [Chips, Biscuits, Chocolates]   
Fruits                           [Alcoholic Drinks, Juices, Fruits]   
Juices                           [Alcoholic Drinks, Juices, Fruits]   
Plants                                     [Plants, Veggies, Chips]   
Veggies                                    [Plants, Veggies, Chips]   

                             multi_labels  
tmp                                        
Alcoholic Drinks                [4, 5, 6]  
Biscuits                        [1, 2, 3]  
Cars                                 [11]  
Chips             [[1, 2, 3], [8, 9, 10]]  
Chocolates                      [1, 2, 3]  
Fruits                          [4, 5, 6]  
Juices                          [4, 5, 6]  
Plants                         [8, 9, 10]  
Veggies                        [8, 9, 10]

然后通过str[0]single_items转换为标量并赋值给index,因此可以使用DataFrame.join

out = single_df.set_index(single_df['single_items'].str[0]).join(df).reset_index(drop=True)
print (out)
  single_project_ID        single_items single_labels    multi_project_ID  \
0             ABC_1             [Chips]           [1]  [Combo_1, Combo_3]   
1             DEF_2  [Alcoholic Drinks]           [4]             Combo_2   
2             JKL_3          [Biscuits]           [8]             Combo_1   
3             MNO_3           [Smoking]           [9]                 NaN   

                                         multi_items             multi_labels  
0  [[Chips, Biscuits, Chocolates], [Plants, Veggi...  [[1, 2, 3], [8, 9, 10]]  
1                 [Alcoholic Drinks, Juices, Fruits]                [4, 5, 6]  
2                      [Chips, Biscuits, Chocolates]                [1, 2, 3]  
3                                                NaN                      NaN

相关问题