pandas Python:两个时间序列之间的时间重叠/并发事件

hkmswyz6  于 2023-03-28  发布在  Python
关注(0)|答案(1)|浏览(176)

我正在做一个项目,要求我识别,计数和确定任何两个主题(例如机器)之间具有相同类型的并发事件的持续时间。数据以时间序列(datetime类型)组织,时间间隔不等(即不规则时间序列)。下面是示例数据。对于每台相同的机器,事件在时间上是连续的,没有间隙。

MachineID,Event_StartTime,Event_EndTime,Event_Type,Duration
A,11/10/17 1:08,11/10/17 1:47,MAINTENANCE,0 days 00:39:00
A,11/10/17 1:47,11/10/17 2:56,RUN,0 days 01:09:00
A,11/10/17 2:56,11/10/17 3:41,STOP,0 days 00:45:00
A,11/10/17 3:41,11/10/17 7:33,RUN,0 days 03:52:00
B,11/10/17 7:29,11/10/17 14:54,STOP,0 days 07:25:00
A,11/10/17 7:33,11/23/17 14:44,STOP,13 days 07:11:00
C,11/10/17 10:17,11/10/17 17:07,STOP,0 days 06:50:00
B,11/10/17 14:54,11/10/17 15:53,MAINTENANCE,0 days 00:59:00
D,11/10/17 15:16,11/10/17 15:18,MAINTENANCE,0 days 00:02:00
D,11/10/17 15:18,11/20/17 13:40,RUN,9 days 22:22:00
B,11/10/17 15:53,11/12/17 12:18,RUN,1 days 20:25:00
E,11/10/17 16:57,11/10/17 17:08,STOP,0 days 00:11:00
C,11/10/17 17:07,11/10/17 17:52,MAINTENANCE,0 days 00:45:00
E,11/10/17 17:08,11/10/17 19:50,RUN,0 days 02:42:00
C,11/10/17 17:52,11/18/17 13:31,RUN,7 days 19:39:00
E,11/10/17 19:50,11/10/17 20:04,STOP,0 days 00:14:00

How data would look like in Excel
预期输出:
例如,如果我想识别、计数和确定任何两台机器在“STOP”事件中并发的持续时间,它应该输出:

A&B: Count = 1, Total Duration = 0 days 07:21:00
A&C: Count = 1, Total Duration = 0 days 06:50:00
A&E: Count = 2, Total Duration = 0 days 00:25:00
B&C: Count = 1, Total Duration = 0 days 04:37:00
C&E: Count = 1, Total Duration = 0 days 00:10:00
Other combinations: no concurrent "STOP" events

我对Python有点陌生,还没有能够接近我想要的东西。但是,我希望有一些可以有效运行的东西,因为我正在处理大量的机器和事件。
先谢谢你了!

0yg35tkg

0yg35tkg1#

如果您也分享您的尝试,通常会很有帮助,因为我们不知道您遇到了什么问题。
Determine Whether Two Date Ranges Overlap解释了检查重叠的逻辑。
它还解释了持续时间长度是4 (end - start)计算的最小值。

>>> df.head()
   index MachineID     Event_StartTime       Event_EndTime   Event_Type         Duration
0      0         A 2017-11-10 01:08:00 2017-11-10 01:47:00  MAINTENANCE  0 days 00:39:00
1      1         A 2017-11-10 01:47:00 2017-11-10 02:56:00          RUN  0 days 01:09:00
2      2         A 2017-11-10 02:56:00 2017-11-10 03:41:00         STOP  0 days 00:45:00
3      3         A 2017-11-10 03:41:00 2017-11-10 07:33:00          RUN  0 days 03:52:00
4      4         B 2017-11-10 07:29:00 2017-11-10 14:54:00         STOP  0 days 07:25:00

我个人觉得使用SQL来完成这类任务更容易。
这里有一个使用duckdb的例子,它可以直接使用,也可以从 Dataframe 中读取。

注意:.reset_index()用于在 Dataframe 中创建index列,该列在查询中用于重复数据消除。

import duckdb

duckdb.sql("""
with 
stop_events as (from df where Event_Type = 'STOP'),
concurrent_stop_events as (
   from 
      stop_events t1, 
      stop_events t2
   select
      t1.MachineID id1,
      t2.MachineID id2,
      least( 
         t1.Event_EndTime - t1.Event_StartTime,
         t1.Event_EndTime - t2.Event_StartTime,
         t2.Event_EndTime - t1.Event_StartTime,
         t2.Event_EndTime - t2.Event_StartTime
      ) duration
      where
         t1.index < t2.index
         and
         t1.MachineID != t2.MachineID
         and 
         ((t2.Event_EndTime is NULL) or t1.Event_StartTime <= t2.Event_EndTime)
         and
         ((t1.Event_EndTime is NULL) or t1.Event_EndTime >= t2.Event_StartTime)
)
from 
   concurrent_stop_events        
select 
   id1,
   id2,
   count(*) count,
   to_seconds(
      sum(epoch(duration::interval))::bigint
   ) duration
group by all
""").df()

.df()将结果转换回pandas dataframe:

id1 id2  count        duration
0   B   A      1 0 days 07:21:00
1   B   C      1 0 days 04:37:00
2   A   C      1 0 days 06:50:00
3   A   E      2 0 days 00:25:00
4   C   E      1 0 days 00:10:00

相关问题