如果存在重复的ID,则Diff为下一个End_Date减去上一个End_Date,Diff为最后一个重复ID的End_Date减去Start_Date,否则Diff也为End_Date减去Start_Date。我的数据集如下所示:
df =
Index ID Start_Date End_Date
0 118645 2021-01-04 2021-04-28
1 118985 2021-01-11 2022-01-24
2 119023 2021-01-07 2021-09-08
3 119225 2021-01-08 2021-04-11
4 119225 2021-01-08 2021-04-11
5 119276 2021-01-07 2021-03-16
6 119863 2021-01-11 2021-03-25
7 119924 2021-01-13 2021-09-06
8 119924 2021-01-13 2021-11-09
9 119924 2021-01-13 2022-05-23
10 119924 2021-01-13 2022-11-10
11 119987 2021-01-12 2021-02-23
我对这个问题的解决方案如下:
df['Diff'] = np.where(df.ID == df.ID.shift(), (pd.to_datetime(df["End_Date"]) - pd.to_datetime(df["End_Date"]).shift()) // np.timedelta64(1, 'D'), None)
df['Diff'] = np.where(df.ID != df.ID.shift(), (pd.to_datetime(df["End_Date"]) - pd.to_datetime(df["Start_Date"])) // np.timedelta64(1, 'D'), df['Diff'])
df_unique = df.drop_duplicates(subset="ID", keep="last")
df_unique['Diff'] = df_unique['End_Date'].sub(df_unique['Start_Date'], axis=0)
df_final = df_unique.combine_first(df)
df_final =
Index ID Start_Date End_Date Diff
0 118645 2021-01-04 2021-04-28 114
1 118985 2021-01-11 2022-01-24 378
2 119023 2021-01-07 2021-09-08 244
3 119225 2021-01-08 2021-04-11 93
4 119225 2021-01-08 2021-04-11 93
5 119276 2021-01-07 2021-03-16 68
6 119863 2021-01-11 2021-03-25 73
7 119924 2021-01-13 2021-09-06 236
8 119924 2021-01-13 2021-11-09 64
9 119924 2021-01-13 2022-05-23 195
10 119924 2021-01-13 2022-11-10 666
11 119987 2021-01-12 2021-02-23 42
有没有更好的方法来解决这个问题?谢谢你的贡献:)
1条答案
按热度按时间lsmepo6l1#