pandas 计算两列的组合差异和一列中的运行差异

3duebb1j  于 2023-03-06  发布在  其他
关注(0)|答案(1)|浏览(141)

如果存在重复的ID,则Diff为下一个End_Date减去上一个End_Date,Diff为最后一个重复ID的End_Date减去Start_Date,否则Diff也为End_Date减去Start_Date。我的数据集如下所示:

df = 

Index  ID   Start_Date  End_Date
0    118645 2021-01-04  2021-04-28
1    118985 2021-01-11  2022-01-24
2    119023 2021-01-07  2021-09-08
3    119225 2021-01-08  2021-04-11
4    119225 2021-01-08  2021-04-11
5    119276 2021-01-07  2021-03-16
6    119863 2021-01-11  2021-03-25
7    119924 2021-01-13  2021-09-06
8    119924 2021-01-13  2021-11-09
9    119924 2021-01-13  2022-05-23
10   119924 2021-01-13  2022-11-10
11   119987 2021-01-12  2021-02-23

我对这个问题的解决方案如下:

df['Diff'] = np.where(df.ID == df.ID.shift(), (pd.to_datetime(df["End_Date"]) - pd.to_datetime(df["End_Date"]).shift()) // np.timedelta64(1, 'D'), None)

df['Diff'] = np.where(df.ID != df.ID.shift(), (pd.to_datetime(df["End_Date"]) - pd.to_datetime(df["Start_Date"])) // np.timedelta64(1, 'D'), df['Diff'])

df_unique = df.drop_duplicates(subset="ID", keep="last")

df_unique['Diff'] = df_unique['End_Date'].sub(df_unique['Start_Date'], axis=0)

df_final = df_unique.combine_first(df)

df_final = 

Index  ID   Start_Date  End_Date    Diff
0    118645 2021-01-04  2021-04-28  114
1    118985 2021-01-11  2022-01-24  378
2    119023 2021-01-07  2021-09-08  244
3    119225 2021-01-08  2021-04-11  93
4    119225 2021-01-08  2021-04-11  93
5    119276 2021-01-07  2021-03-16  68
6    119863 2021-01-11  2021-03-25  73
7    119924 2021-01-13  2021-09-06  236
8    119924 2021-01-13  2021-11-09  64
9    119924 2021-01-13  2022-05-23  195
10   119924 2021-01-13  2022-11-10  666
11   119987 2021-01-12  2021-02-23  42

有没有更好的方法来解决这个问题?谢谢你的贡献:)

lsmepo6l

lsmepo6l1#

import pandas as pd

df = pd.DataFrame({'ID':[118645, 118985, 119023, 119225, 119225, 119276, 119863, 
                         119924, 119924, 119924, 119924, 119987],
                   'Start_Date':['2021-01-04', '2021-01-11', '2021-01-07', '2021-01-08', 
                                 '2021-01-08', '2021-01-07', '2021-01-11', '2021-01-13',
                                 '2021-01-13', '2021-01-13', '2021-01-13', '2021-01-12'],
                   'End_Date':['2021-04-28', '2022-01-24', '2021-09-08', '2021-04-11', 
                                 '2021-04-11', '2021-03-16', '2021-03-25', '2021-09-06',
                                 '2021-11-09', '2022-05-23', '2022-11-10', '2021-02-23']
                   })

def diff(g):
    g['diff'] = (pd.to_datetime(g['End_Date'], infer_datetime_format=True)
                 - pd.to_datetime(g['Start_Date'], infer_datetime_format=True)  
                 ).dt.days
    if len(g) > 1:
        g['diff'][1:-1] = ( g['diff'][:-1].diff()[1:] ).astype(int)
    return g

r = (df.groupby('ID')
       .apply(lambda g: diff(g))
       )

print(r)
ID  Start_Date    End_Date  diff
0   118645  2021-01-04  2021-04-28   114
1   118985  2021-01-11  2022-01-24   378
2   119023  2021-01-07  2021-09-08   244
3   119225  2021-01-08  2021-04-11    93
4   119225  2021-01-08  2021-04-11    93
5   119276  2021-01-07  2021-03-16    68
6   119863  2021-01-11  2021-03-25    73
7   119924  2021-01-13  2021-09-06   236
8   119924  2021-01-13  2021-11-09    64
9   119924  2021-01-13  2022-05-23   195
10  119924  2021-01-13  2022-11-10   666
11  119987  2021-01-12  2021-02-23    42

相关问题