pandas panda Dataframe 中的分区计算

kzmpq1sx  于 2023-01-28  发布在  其他
关注(0)|答案(3)|浏览(144)

我有这样一张table:

ID  Timestamp   Status
A   5/30/2022 2:29  Run Ended
A   5/30/2022 0:23  In Progress
A   5/30/2022 0:22  Prepared
B   5/30/2022 11:15 Run Ended
B   5/30/2022 9:18  In Progress
B   5/30/2022 0:55  Prepared

我想计算按ID分组的每个状态之间的持续时间。因此,生成的输出表将是:

ID  Duration(min)   Status change
A   0.40    In Progress-Prepared
A   125.82  Run Ended - In Progress
B   502.78  In Progress-Prepared
B   117.34  Run Ended - In Progress

如何按时间戳降序排序(按ID分组),然后从前一行减去最后一行,直到每个ID组的顶部?

a64a0gku

a64a0gku1#

您可以使用groupby.diffgroupby.shift

out = (df
 .assign(**{'Duration(min)': pd.to_datetime(df['Timestamp'], dayfirst=False)
            .groupby(df['ID'])
            .diff(-1).dt.total_seconds() # diff in seconds to next time in group
            .div(60), # convert to minutes
           'Status change': df.groupby('ID')['Status'].shift(-1)+'-'+df['Status']
           })
 .dropna(subset='Duration(min)') # get rid of empty rows
 [['ID', 'Duration(min)', 'Status change']]
 )

输出:

ID  Duration(min)          Status change
0  A          126.0  In Progress-Run Ended
1  A            1.0   Prepared-In Progress
3  B          117.0  In Progress-Run Ended
4  B          503.0   Prepared-In Progress
kzipqqlq

kzipqqlq2#

您可以使用groupby('ID')[value].shift(1)访问同一ID组中的上一个value

import pandas as pd

df = pd.DataFrame({
    'ID': ['a','a','a','b','b','b'],
    'time': [1,2,3,1,4,5],
    'status': ['x','y','z','xx','yy','zz']
})
df['previous_time'] = df.groupby('ID')['time'].shift(1)
df['previous_status'] = df.groupby('ID')['status'].shift(1)
df = df.dropna()
df['duration'] = df['time'] - df['previous_time'] # change this line to calculate duration between time instead
df['status_change'] = df['previous_status'] + '-' + df['status']
print (df[['ID','duration','status_change']].to_markdown(index=False))

输出:
| 识别号|持续时间|状态更改|
| - ------|- ------|- ------|
| 项目a|1个|x-y轴|
| 项目a|1个|y-z坐标|
| b.人口基金|三个|年月日|
| b.人口基金|1个|yy-zz|
PS.您可以将timeprevious_time减去this thread中的答案

cwdobuhd

cwdobuhd3#

def function1(dd:pd.DataFrame):
    val1=dd.query("Status=='In Progress'").Timestamp.squeeze()-dd.query("Status=='Prepared'").Timestamp.squeeze()
    dd1=pd.DataFrame({'ID':dd.name,"Duration(min)":val1.total_seconds()/60,"Status change":"In Progress-Prepared"},[0])

    val2=dd.query("Status=='Run Ended'").Timestamp.squeeze()-dd.query("Status=='In Progress'").Timestamp.squeeze()
    dd2=pd.DataFrame({'ID':dd.name,"Duration(min)":val2.total_seconds()/60,"Status change":"Run Ended - In Progress"},[1])

    return pd.concat([dd1,dd2])

df1=df1.assign(Timestamp=pd.to_datetime(df1.Timestamp))
df1.groupby('ID').apply(function1).reset_index(drop=True)

出局

ID  Duration(min)            Status change
0  A            1.0     In Progress-Prepared
1  A          126.0  Run Ended - In Progress
2  B          503.0     In Progress-Prepared
3  B          117.0  Run Ended - In Progress

相关问题