pandas 根据阈值和组大小在DataFrame中添加列

8ulbf1ek  于 2023-08-01  发布在  其他
关注(0)|答案(3)|浏览(118)

我有一个带有x和y坐标的DataFrame,其中索引表示时间戳。我们可以假设它是移动每个时间步长的对象。预期连续时间戳之间的距离将增加。然而,如果距离没有增加一定的阈值,我认为这是一个潜在的“等待”位置。我使用“潜在”这个词,因为数据非常嘈杂,一个“等待”条件不足以真正确定物体没有移动。因此,我需要至少3个或更多个连续的“等待”条件,才能确定物体确实没有移动。
我想检测这些等待位置,并相应地在一个新的列中标记它们。

Example :
                    x         y
timestamp                       
2023-07-01 00:00:00   1         5
2023-07-01 00:01:00   2         6
2023-07-01 00:02:00   3         7
2023-07-01 00:03:00   4         8
2023-07-01 00:04:00   4         8
2023-07-01 00:05:00   5         9
2023-07-01 00:06:00   6         9
2023-07-01 00:07:00   7        10
2023-07-01 00:08:00   7        10
2023-07-01 00:09:00   7        10
2023-07-01 00:10:00   7        10
2023-07-01 00:11:00   8        11
2023-07-01 00:12:00   9        11

字符串
为了计算距离,我已经将dataframe移动1,并计算了距离:

x         y  distance  
timestamp                                                    
2023-07-01 00:00:00   1         5       NaN   
2023-07-01 00:01:00   2         6  1.414214    
2023-07-01 00:02:00   3         7  1.414214   
2023-07-01 00:03:00   4         8  1.414214   
2023-07-01 00:04:00   4         8  0.000000   
2023-07-01 00:05:00   5         9  1.414214  
2023-07-01 00:06:00   6         9  1.000000   
2023-07-01 00:07:00   7        10  1.414214   
2023-07-01 00:08:00   7        10  0.000000   
2023-07-01 00:09:00   7        10  0.000000   
2023-07-01 00:10:00   7        10  0.000000   
2023-07-01 00:11:00   8        11  1.414214   
2023-07-01 00:12:00   9        11  1.000000


现在,假设如果距离小于1,则它可能是等待位置:

x         y  distance  condition_fulfilled  
timestamp                                                    
2023-07-01 00:00:00   1         5       NaN    NaN    
2023-07-01 00:01:00   2         6  1.414214    False    
2023-07-01 00:02:00   3         7  1.414214    False    
2023-07-01 00:03:00   4         8  1.414214    False    
2023-07-01 00:04:00   4         8  0.000000    True   
2023-07-01 00:05:00   5         9  1.414214    False   
2023-07-01 00:06:00   6         9  1.000000    False   
2023-07-01 00:07:00   7        10  1.414214    False   
2023-07-01 00:08:00   7        10  0.000000    True   
2023-07-01 00:09:00   7        10  0.000000    True   
2023-07-01 00:10:00   7        10  0.000000    True    
2023-07-01 00:11:00   8        11  1.414214    False    
2023-07-01 00:12:00   9        11  1.000000    False


由于我需要至少3个连续满足的条件,因此预期输出为:

x         y  distance    status  
timestamp                                                    
2023-07-01 00:00:00   1         5       NaN    moving    
2023-07-01 00:01:00   2         6  1.414214    moving    
2023-07-01 00:02:00   3         7  1.414214    moving    
2023-07-01 00:03:00   4         8  1.414214    moving    
2023-07-01 00:04:00   4         8  0.000000    moving   
2023-07-01 00:05:00   5         9  1.414214    moving   
2023-07-01 00:06:00   6         9  1.000000    moving   
2023-07-01 00:07:00   7        10  1.414214    moving   
2023-07-01 00:08:00   7        10  0.000000    waiting   
2023-07-01 00:09:00   7        10  0.000000    waiting   
2023-07-01 00:10:00   7        10  0.000000    waiting    
2023-07-01 00:11:00   8        11  1.414214    moving    
2023-07-01 00:12:00   9        11  1.000000    moving

6jjcrrmo

6jjcrrmo1#

试试看:

# fill the first NaN
df['condition_fulfilled'] = df['condition_fulfilled'].bfill()

tmp = (df['condition_fulfilled'] != df['condition_fulfilled'].shift()).cumsum()
df['status'] = df.groupby(tmp)['condition_fulfilled'].transform(lambda x: 'waiting' if x.all() and len(x) >= 3 else 'moving')

print(df)

字符串
图纸:

x   y  distance  condition_fulfilled   status
timestamp                                                         
2023-07-01 00:00:00  1   5       NaN                False   moving
2023-07-01 00:01:00  2   6  1.414214                False   moving
2023-07-01 00:02:00  3   7  1.414214                False   moving
2023-07-01 00:03:00  4   8  1.414214                False   moving
2023-07-01 00:04:00  4   8  0.000000                 True   moving
2023-07-01 00:05:00  5   9  1.414214                False   moving
2023-07-01 00:06:00  6   9  1.000000                False   moving
2023-07-01 00:07:00  7  10  1.414214                False   moving
2023-07-01 00:08:00  7  10  0.000000                 True  waiting
2023-07-01 00:09:00  7  10  0.000000                 True  waiting
2023-07-01 00:10:00  7  10  0.000000                 True  waiting
2023-07-01 00:11:00  8  11  1.414214                False   moving
2023-07-01 00:12:00  9  11  1.000000                False   moving

ccgok5k5

ccgok5k52#

试试这个:

import numpy as np
df['status'] = (df.groupby(['distance', df['condition_fulfilled'].diff().ne(0).cumsum()])
                   ['distance'].transform('size').eq(3).astype(bool))
df['status'] = np.where(df['status'], 'waiting', 'moving')

字符串
输出为:

x   y  distance  condition_fulfilled   status
timestamp                                                         
2023-07-01 00:00:00  1   5       NaN                False   moving
2023-07-01 00:01:00  2   6  1.414214                False   moving
2023-07-01 00:02:00  3   7  1.414214                False   moving
2023-07-01 00:03:00  4   8  1.414214                False   moving
2023-07-01 00:04:00  4   8  0.000000                 True   moving
2023-07-01 00:05:00  5   9  1.414214                False   moving
2023-07-01 00:06:00  6   9  1.000000                False   moving
2023-07-01 00:07:00  7  10  1.414214                False   moving
2023-07-01 00:08:00  7  10  0.000000                 True  waiting
2023-07-01 00:09:00  7  10  0.000000                 True  waiting
2023-07-01 00:10:00  7  10  0.000000                 True  waiting
2023-07-01 00:11:00  8  11  1.414214                False   moving
2023-07-01 00:12:00  9  11  1.000000                False   moving

ej83mcc0

ej83mcc03#

您可以用途:

N = 3

df['condition_fulfilled'] = df['condition_fulfilled'].fillna(False)

df['status'] = np.where(
                df.groupby((~df['condition_fulfilled']).cumsum())
                  .transform('size').ge(N+1)
                & df['condition_fulfilled'],
                  'waiting', 'moving'
                )

字符串
输出量:

x   y  distance  condition_fulfilled   status
timestamp                                                         
2023-07-01 00:00:00  1   5       NaN                False   moving
2023-07-01 00:01:00  2   6  1.414214                False   moving
2023-07-01 00:02:00  3   7  1.414214                False   moving
2023-07-01 00:03:00  4   8  1.414214                False   moving
2023-07-01 00:04:00  4   8  0.000000                 True   moving
2023-07-01 00:05:00  5   9  1.414214                False   moving
2023-07-01 00:06:00  6   9  1.000000                False   moving
2023-07-01 00:07:00  7  10  1.414214                False   moving
2023-07-01 00:08:00  7  10  0.000000                 True  waiting
2023-07-01 00:09:00  7  10  0.000000                 True  waiting
2023-07-01 00:10:00  7  10  0.000000                 True  waiting
2023-07-01 00:11:00  8  11  1.414214                False   moving
2023-07-01 00:12:00  9  11  1.000000                False   moving

相关问题