如何有条件地将相同的随机ID分配给Pandas中的“相关”行?

wooyq4lh  于 2022-11-20  发布在  其他
关注(0)|答案(2)|浏览(102)

刚接触Python的时候,我正在努力解决为“相关”行分配一些随机ID的问题,其中的关系仅仅是它们在按用户分组的连续几天中的接近度(14天内)。在那个例子中,我选择了uuid,没有任何特定的意图。它可以是任何其他随机ID,唯一标识概念上相关的行。

import pandas as pd
        import uuid
        import numpy as np

下面是一个虚拟 Dataframe :

dummy_df = pd.DataFrame({"transactionid": [1, 2, 3, 4, 5, 6, 7, 8],
                                "user": ["michael", 
                                                    "michael", 
                                                    "michael", 
                                                    "tom", 
                                                    "tom", 
                                                    "tom",
                                                    "tom", 
                                                    "tom"],
                                "transactiontime": pd.to_datetime(["2022-01-01", 
                                                                "2022-01-02", 
                                                                "2022-01-03", 
                                                                "2022-09-01", 
                                                                "2022-09-13",
                                                                "2022-10-17",
                                                                "2022-10-20",
                                                                "2022-11-17"])})
        dummy_df.head(10)
        
        transactionid   user    transactiontime
                0   1   michael 2022-01-01
                1   2   michael 2022-01-02
                2   3   michael 2022-01-03
                3   4   tom     2022-09-01
                4   5   tom     2022-09-13
                5   6   tom     2022-10-17
                6   7   tom     2022-10-20
                7   8   tom     2022-11-17

在这里,我对事务进行排序,并计算它们之间的天数差:

dummy_df = dummy_df.assign(
            timediff = dummy_df
            .sort_values('transactiontime')
            .groupby(["user"])['transactiontime'].diff() / np.timedelta64(1, 'D')
            ).fillna(0)
        
        dummy_df.head(10)
        
        
        transactionid   user    transactiontime timediff
                0   1   michael 2022-01-01  0.0
                1   2   michael 2022-01-02  1.0
                2   3   michael 2022-01-03  1.0
                3   4   tom     2022-09-01  0.0
                4   5   tom     2022-09-13  12.0
                5   6   tom     2022-10-17  34.0
                6   7   tom     2022-10-20  3.0
                7   8   tom     2022-11-17  28.0

这里我为每个相关事务创建了一个带有随机ID的新列--尽管它没有按预期工作:

dummy_df.assign(related_transaction = np.where((dummy_df.timediff >= 0) & (dummy_df.timediff < 15), uuid.uuid4(), dummy_df.transactionid))
        
        
        transactionid   user    transactiontime timediff    related_transaction
                0   1   michael 2022-01-01  0.0  fd630f07-6564-4773-aff9-44ecb1e4211d
                1   2   michael 2022-01-02  1.0  fd630f07-6564-4773-aff9-44ecb1e4211d
                2   3   michael 2022-01-03  1.0  fd630f07-6564-4773-aff9-44ecb1e4211d
                3   4   tom     2022-09-01  0.0  fd630f07-6564-4773-aff9-44ecb1e4211d
                4   5   tom     2022-09-13  12.0 fd630f07-6564-4773-aff9-44ecb1e4211d
                5   6   tom     2022-10-17  34.0    6
                6   7   tom     2022-10-20  3.0  fd630f07-6564-4773-aff9-44ecb1e4211d
                7   8   tom     2022-11-17  28.0    8

假设交易之间的用户组差异在14天之内,我期望的结果如下:

transactionid   user    transactiontime timediff    related_transaction
                0   1   michael 2022-01-01  0.0  ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
                1   2   michael 2022-01-02  1.0  ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
                2   3   michael 2022-01-03  1.0  ad2a8f23-05a5-49b1-b45e-cbf3f0ba23ff
                3   4   tom     2022-09-01  0.0  b1da2251-7770-4756-8863-c82f90657542
                4   5   tom     2022-09-13  12.0 b1da2251-7770-4756-8863-c82f90657542
                5   6   tom     2022-10-17  34.0 485a8d97-80d1-4184-8fc8-99523f471527
                6   7   tom     2022-10-20  3.0  485a8d97-80d1-4184-8fc8-99523f471527
                7   8   tom     2022-11-17  28.0    8
wko9yo5t

wko9yo5t1#

借鉴Luise的想法,我们从related_transaction的空列开始。然后,我们遍历每一行。对于每个日期,我们检查它是否已经是事务的一部分。如果是,则继续。否则,为同一用户的该日期和随后15天内的所有其他日期分配一个新事务:

import datetime
df = dummy_df
df['related_transaction'] = None
for i, row in dummy_df.iterrows():
    if df.loc[i].related_transaction is not None:
        # We already assigned that row
        continue
    df.loc[  # Select where:
        (df.transactiontime <= row.transactiontime + datetime.timedelta(days=15)) & #  Current row + 15 days
        (df.user == row.user) &  # Same user
        (pd.isna(df.related_transaction)),  # Don't overwrite anything already assigned
        'related_transaction'  #  Set this column to:
    ] = uuid.uuid4()  # Assign new UUID

输出如下:

transactionid   user    transactiontime related_transaction
0               1   michael 2022-01-01      82d28e10-149b-481e-ba41-f5833662ba99
1               2   michael 2022-01-02      82d28e10-149b-481e-ba41-f5833662ba99
2               3   michael 2022-01-03      82d28e10-149b-481e-ba41-f5833662ba99
3               4   tom     2022-09-01      fa253663-8615-419a-afda-7646906024f0
4               5   tom     2022-09-13      fa253663-8615-419a-afda-7646906024f0
5               6   tom     2022-10-17      d6152d4b-1560-40e0-8589-bd8e3da363db
6               7   tom     2022-10-20      d6152d4b-1560-40e0-8589-bd8e3da363db
7               8   tom     2022-11-17      2a93d78d-b6f6-4f0f-bb09-1bc18361aa21

在您的示例中,日期已经排序,这是我在这里所做的一个重要假设!

bf1o4zei

bf1o4zei2#

您的代码与所需结果之间的不匹配是uuid.uuid4()只创建一次ID,并将其分配给np.where()定义的所有相关行。
请尝试以下方法:

df.loc[ROW_CONDITIONs, COLUMNS] = VECTORIZED_ID_GENERATOR

在你的例子中

dummy_df.loc[(dummy_df['timediff'] >= 0) & (dummy_df['timediff'] < 15), 'related_transaction'] = dummy_df.apply(lambda _: uuid.uuid4(), axis=1)

考虑到这只解决了如何在Pandas中有条件地使用uuid分配随机ID的问题。在我看来,您还需要为同一用户和每15天的交易生成相同的ID。我的建议是生成一个 Dataframe ,其中每行都是两个事务的组合,并添加一个条件,说明两个事务的用户都需要一样的。

相关问题