pandas Dataframe groupby时间戳间隔、非重叠和列值之和

dm7nw8vv 于 2023-04-28 发布在其他

关注(0)|答案(1)|浏览(90)

我尝试将给定时间戳间隔的条目分组，不重叠，如果另一列的值高于某个阈值。
下面是一个简化的例子。
我有一个这样的dataframe：

timestamp      col1  col2 col3  col4
0   2317614314  1.551823     1    4    44
1   2317614409  1.206112     3    3    25
2   2317614429  1.022747     2    3    48
3   2317614608  2.082569     3    3    59
4   2317622053  2.260681     1    2    15
5   2317622208  2.355770     2    4    46

我想根据以下规则对数据进行分组：

行仅在一个组中
与第一个时间戳相关的间隔中的行
col1的值之和必须大于阈值

举个例子：
假设间隔为200，阈值为4：

行0，1，2在第一个时间戳之后的200的区间内（2317614314〈=时间戳〈=2317614314+200）
col1的总和小于阈值（1.551823+1.206112+1.022747〈4）
所以它忽略行0并继续
第1、2、3行在第一个时间戳之后的200的区间内（2317614409〈=时间戳〈=2317614409+200）
col1的总和高于阈值（1.206112+1.022747+2.082569〉4）
由于行不能在多个组中，因此将从第4行开始
第4、5行在第一个时间戳之后的200的区间内（2317622053〈=时间戳〈=2317622053+200）
col1的总和高于阈值（2.260681+2.355770〉4）

最后，我将以两组结束：

第1、2、3行
第4、5行

一些注意事项：

我用的是Dask
使用Pandas的解决方案非常受欢迎
时间戳的单位是ps，因此不使用“datetime64”，因为pandas只支持ns精度

pandas

来源：https://stackoverflow.com/questions/76063050/dataframe-groupby-timestamp-interval-non-overlapping-and-sum-of-column-values

1条答案

按热度按时间

62o28rlo1#

好吧，我有个可行的解决办法。

@dask.delayed
def get_groups(df, interval, threshold):

    start_time = df['timestamp'].min()
    end_time = start_time + interval
    current_group = []
    current_sum = 0

    for _, row in df.iterrows():
        if row['time'] > end_time:
            if current_sum >= threshold:
            yield pandas.DataFrame(current_group)
            
            current_group = []
            current_sum = 0
            start_time = row['timestamp']
            end_time = start_time + interval
    
        current_group.append(row)
        current_sum += row['col1']

    if current_sum >= threshold:
        yield pandas.DataFrame(current_group)

interval=100
threshold=4
    
result = dask.bag.from_delayed(get_groups(df, interval, threshold))

这似乎很好。如果有人有任何建议，不要犹豫。

赞(0）回复(0）举报 2023-04-28

我来回答

pandas Dataframe groupby时间戳间隔、非重叠和列值之和

1条答案

相关问题

热门标签

最新问答