groupby并比较每组pandas中的时间戳

42fyovps  于 2023-09-29  发布在  其他
关注(0)|答案(2)|浏览(78)

我有以下pandas dataframe:

id |       start         |        end         |
---|---------------------|--------------------|
TA | 2022-05-20 06:30:36 | 2022-05-20 09:58:52|
TA | 2022-05-20 08:47:13 | 2022-05-20 08:57:47|
TA | 2022-05-20 08:44:11 | 2022-05-20 10:15:14|
TA | 2022-06-10 07:45:11 | 2022-06-10 10:15:14|
TA | 2022-06-10 07:55:11 | 2022-06-10 11:15:14|
BA | 2022-05-24 08:48:12 | 2022-05-24 10:57:27|
BA | 2022-05-24 10:48:29 | 2022-05-24 12:08:54|
RG | 2022-05-31 07:57:26 | 2022-05-31 08:09:46|
RG | 2022-05-31 08:06:50 | 2022-05-31 08:08:49|
RG | 2022-05-31 08:07:51 | 2022-05-31 08:18:37|

对于每个id,我想比较开始时间戳是否包含在开始和结束时间戳之间,如果包含,那么我从开始列中取最低的时间戳值,从结束列中取最高的时间戳值。生成的dataframe将如下所示:

id |       start         |        end         |
---|---------------------|--------------------|
TA | 2022-05-20 06:30:36 | 2022-05-20 10:15:14|
TA | 2022-06-10 07:45:11 | 2022-06-10 11:15:14|
BA | 2022-05-24 08:48:12 | 2022-05-24 12:08:54|
RG | 2022-05-31 07:57:26 | 2022-05-31 08:18:37|

可能存在其中时间戳可能未被包含/重叠在组中的行,并且那些行将保持原样,但是如上面的示例中重叠的那些行将被减少和分组。有没有人能在python中提出一个最佳的方法来实现这一点?

更新

数据按 idstart 排序。

wfveoks0

wfveoks01#

好吧,找到了一个方法,想在这里分享给其他有类似问题的人:

# Initialize variables to store the current interval for each group
current_group = None
current_start = None
current_end = None

# Lists to store the resulting data
result_groups = []
result_start = []
result_end = []

# Iterate through the DataFrame
for _, row in df.iterrows():
    group = row['id']
    start = row['start']
    end = row['end']
    
    # If the group changes, start a new interval
    if group != current_group:
        current_group = group
        current_start = start
        current_end = end
    else:
        # Check if the start time is within the current interval
        if start <= current_end:
            # Update the end time if the current row's end time is greater
            current_end = max(current_end, end)
        else:
            # Save the current interval and start a new one
            result_groups.append(current_group)
            result_start.append(current_start)
            result_end.append(current_end)
            current_start = start
            current_end = end

# Save the last interval
result_groups.append(current_group)
result_start.append(current_start)
result_end.append(current_end)

# Create a new DataFrame with the results
result_df = pd.DataFrame({'Group': result_groups, 'Start_Time': result_start, 'End_Time': result_end})
sh7euo9m

sh7euo9m2#

如果数据按"id""start"排序,我们可以创建额外的累积最大结束列,然后在出现新的组id或开始数据大于前一个累积最大结束的情况下拆分数据,并在每个获得的部分中获得第一个开始日期和最后一个累积最大结束日期作为聚合值:

df['cummax_end'] = df.groupby('id', as_index=False, sort=False)['end'].cummax()
is_new_group = df['id'] != df['id'].shift()
is_new_interval = df['start'] > df['cummax_end'].shift()
grouper = (is_new_group | is_new_interval).cumsum()

answer = df.groupby(grouper).agg({
    'id': 'first', 
    'start': 'first', 
    'cummax_end': 'last'
})

对测试数据

data = '''\
id |       start         |        end         |
---|---------------------|--------------------|
TA | 2022-05-20 06:30:36 | 2022-05-20 09:58:52|
TA | 2022-05-20 08:47:13 | 2022-05-20 08:57:47|
TA | 2022-05-20 08:44:11 | 2022-05-20 10:15:14|
TA | 2022-06-10 07:45:11 | 2022-06-10 10:15:14|
TA | 2022-06-10 07:55:11 | 2022-06-10 11:15:14|
BA | 2022-05-24 08:48:12 | 2022-05-24 10:57:27|
BA | 2022-05-24 10:48:29 | 2022-05-24 12:08:54|
RG | 2022-05-31 07:57:26 | 2022-05-31 08:09:46|
RG | 2022-05-31 08:06:50 | 2022-05-31 08:08:49|
RG | 2022-05-31 08:07:51 | 2022-05-31 08:18:37|
'''

df = pd.read_table(
    StringIO(data), 
    sep = ' *\| *', 
    skiprows=[1],
    engine='python',
    usecols=[0,1,2], 
    parse_dates=[1, 2],
)

我们得到输出

>>> print(answer.to_string(index=False))
id               start          cummax_end
TA 2022-05-20 06:30:36 2022-05-20 10:15:14
TA 2022-06-10 07:45:11 2022-06-10 11:15:14
BA 2022-05-24 08:48:12 2022-05-24 12:08:54
RG 2022-05-31 07:57:26 2022-05-31 08:18:37

相关问题