pandas 如何在一个键的窗口中为两个集合创建条件列?

new9mtju  于 12个月前  发布在  其他
关注(0)|答案(1)|浏览(104)

我有两个表,其中包含了在前期和后期使用的零售商用户的信息。我的目标是找到哪些零售商对后期的用户来说是新的。我想知道如何在user_id窗口中做到这一点,因为is_in()loc[]解决方案不适合我的任务。我也在考虑过滤连接(特别是反连接),但效果不好。下面是示例数据:

sample1 = pd.DataFrame(
    {
        'user_id': [45, 556, 556, 556, 556, 556, 556, 1344, 1588, 2063, 2063, 2063, 2673, 2982, 2982],
        'retailer': ['retailer_1', 'retailer_1', 'retailer_2', 'retailer_3', 'retailer_4', 'retailer_5', 'retailer_6', 
                     'retailer_3', 'retailer_2', 'retailer_2', 'retailer_3', 'retailer_7', 'retailer_1', 'retailer_1', 'retailer_2']
    }
)

sample2 = pd.DataFrame(
    {
        'user_id': [45, 45, 556, 556, 556, 556, 556, 556, 1344, 1588, 2063, 2063, 2063, 2673, 2673, 2982, 2982],
        'retailer': ['retailer_1', 'retailer_6', 'retailer_1', 'retailer_2', 'retailer_3', 'retailer_4', 'retailer_5', 'retailer_6', 
                     'retailer_3', 'retailer_2', 'retailer_2', 'retailer_3', 'retailer_7', 'retailer_1', 'retailer_2', 'retailer_1', 'retailer_2']
    }
)

字符串
我想要的结果是这样的:

{'user_id': {0: 45, 1: 45, 2: 556, 3: 556, 4: 556, 5: 556, 6: 556, 7: 556, 8: 1344, 9: 1588, 10: 2063, 11: 2063, 12: 2063, 13: 2673, 14: 2673, 15: 2982, 16: 2982}, 'retailer': {0: 'retailer_1', 1: 'retailer_6', 2: 'retailer_1', 3: 'retailer_2', 4: 'retailer_3', 5: 'retailer_4', 6: 'retailer_5', 7: 'retailer_6', 8: 'retailer_3', 9: 'retailer_2', 10: 'retailer_2', 11: 'retailer_3', 12: 'retailer_7', 13: 'retailer_1', 14: 'retailer_2', 15: 'retailer_1', 16: 'retailer_2'}, 'is_new_retailer': {0: 0, 1: 1, 2: 0, 3: 0, 4: 0, 5: 0, 6: 0, 7: 0, 8: 0, 9: 0, 10: 0, 11: 0, 12: 0, 13: 0, 14: 1, 15: 0, 16: 0}}

4xy9mtcn

4xy9mtcn1#

一个想法是在左连接中使用indicator参数,这样就可以从左侧识别行,如果两个DataFrame中的两个列都没有重复,那么就可以完美地工作。

out = sample2.merge(sample1, indicator='is_new_retailer', how='left')
out['is_new_retailer'] = out['is_new_retailer'].eq('left_only').astype(int)
print (out)
    user_id    retailer  is_new_retailer
0        45  retailer_1                0
1        45  retailer_6                1
2       556  retailer_1                0
3       556  retailer_2                0
4       556  retailer_3                0
5       556  retailer_4                0
6       556  retailer_5                0
7       556  retailer_6                0
8      1344  retailer_3                0
9      1588  retailer_2                0
10     2063  retailer_2                0
11     2063  retailer_3                0
12     2063  retailer_7                0
13     2673  retailer_1                0
14     2673  retailer_2                1
15     2982  retailer_1                0
16     2982  retailer_2                0

字符串
一般解决方案是通过Index.isin测试MultiIndex es:

mux1 = pd.MultiIndex.from_frame(sample2[['user_id','retailer']])
mux2 = pd.MultiIndex.from_frame(sample1[['user_id','retailer']])
sample2['is_new_retailer'] = (~mux1.isin(mux2).astype(int)
print (sample2)
    user_id    retailer  is_new_retailer
0        45  retailer_1                0
1        45  retailer_6                1
2       556  retailer_1                0
3       556  retailer_2                0
4       556  retailer_3                0
5       556  retailer_4                0
6       556  retailer_5                0
7       556  retailer_6                0
8      1344  retailer_3                0
9      1588  retailer_2                0
10     2063  retailer_2                0
11     2063  retailer_3                0
12     2063  retailer_7                0
13     2673  retailer_1                0
14     2673  retailer_2                1
15     2982  retailer_1                0
16     2982  retailer_2                0

相关问题