Pandas Dataframe 如何基于时间增量合并2个dfs?

gpfsuwkq  于 2022-11-27  发布在  其他
关注(0)|答案(2)|浏览(143)

我有两个 Dataframe :

df1 = a1 a2 recorded_at
       1. 2. 2020-03-18 00:00:01
       8. 1. 2021-04-15 04:00:10
       9. 0. 2021-03-18 12:40:30

df2 = b1 b2 DateTime
       7. 8. 2020-03-18 00:00:01
       2. 4. 2020-03-18 00:00:04
       2. 6. 2021-04-15 04:00:12
       4. 2. 2021-03-18 12:40:40

我想通过比较recorded_atDateTime来合并它们,并取4秒内的所有行。因此,我将得到:

df_new = a1 a2 recorded_at DateTime b1 b2
       1. 2. 2020-03-18 00:00:01 2020-03-18 00:00:01 7 8 
       1. 2. 2020-03-18 00:00:01 2020-03-18 00:00:04 2 4 
       8. 1. 2021-04-15 04:00:10 2021-04-15 04:00:12 2 6

我该怎么做?

q8l4jmvw

q8l4jmvw1#

如果您不希望df1的一行与df2的一行匹配,则merge_asof是一个有效的解决方案。否则,merge计算将是二次的,因此很大程度上取决于每个输入的大小。

df1['recorded_at'] = pd.to_datetime(df1['recorded_at'])
df2['DateTime'] = pd.to_datetime(df2['DateTime'])

out = (pd
 .merge_asof(df2.sort_values(by='DateTime'), df1.sort_values(by='recorded_at'), 
             left_on='DateTime', right_on='recorded_at',
             direction='backward', tolerance=pd.Timedelta('4s')
             )
 .dropna(subset=['recorded_at'])
)

输出:

b1   b2            DateTime   a1   a2         recorded_at
0  7.0  8.0 2020-03-18 00:00:01  1.0  2.0 2020-03-18 00:00:01
1  2.0  4.0 2020-03-18 00:00:04  1.0  2.0 2020-03-18 00:00:01
3  2.0  6.0 2021-04-15 04:00:12  8.0  1.0 2021-04-15 04:00:10
ecfdbz9o

ecfdbz9o2#

初始化 Dataframe

df1 = pd.DataFrame([
    [1.0, 2.0, "2020-03-18 00:00:01"],
    [8.0, 1.0, "2021-04-15 04:00:10"],
    [19.0, 0.0, "2021-03-18 12:40:30"],
    
], columns=["a1", "a2", "recorded_at"])

df2 = pd.DataFrame([
    [7.0, 8.0, "2020-03-18 00:00:01"],
    [2.0, 4.0, "2020-03-18 00:00:04"],
    [2.0, 6.0, "2021-04-15 04:00:12"],
    [4.0, 2.0, "2021-03-18 12:40:40"],
    
], columns=["a1", "a2", "recorded_at"])

转换为Pandas日期时间

df1["recorded_at"] = pd.to_datetime(df1["recorded_at"])
df2["recorded_at"] = pd.to_datetime(df2["recorded_at"])

合并df以创建组合

result = df1.merge(df2, how="cross")

查找时间增量

result["diff"] = abs(result["recorded_at_x"] - result["recorded_at_y"])

提取结果

from datetime import timedelta
result[result["diff"] < timedelta(seconds=4)]

结果:

a1_x    a2_x    recorded_at_x   a1_y    a2_y    recorded_at_y   diff
0   1.0 2.0 2020-03-18 00:00:01 7.0 8.0 2020-03-18 00:00:01 0 days 00:00:00
1   1.0 2.0 2020-03-18 00:00:01 2.0 4.0 2020-03-18 00:00:04 0 days 00:00:03
6   8.0 1.0 2021-04-15 04:00:10 2.0 6.0 2021-04-15 04:00:12 0 days 00:00:02

它适用于示例输入。但如果数据量很大,您可能需要更好的策略。

相关问题