pandas 时间段(持续时间)到月/年的细分

bttbmeg0  于 2023-06-20  发布在  其他
关注(0)|答案(3)|浏览(151)

我的初始dataframe看起来如下:
| ID|开始|结束|持续时间天数|
| - -----|- -----|- -----|- -----|
| 1| 2023-05-20 12:00:00.000| 2023-06-03 12:00:00.000|十四|
| 2| 2023-05-20 12:00:00.000| 2023-05-23 12:00:00.000| 3|
| 1| 2023-06-01 12:00:00.000| 2023-06-03 12:00:00.000| 2|
如何从这到一个像下面这样的数据框架?问题是,有相当低的行(数百万)。因此,性能非常重要。
| ID|年份|月|持续时间天数|
| - -----|- -----|- -----|- -----|
| 1| 2023年|五月|十一|
| 2| 2023年|五月|3|
| 1| 2023年|六月|5个|

**更新:**请注意,可能会有一个多月的休息时间。示例:2023-02-20(2月)和2023-12-18(12月)

b5lpy0ml

b5lpy0ml1#

@jezrael的方法是正确的,但对于大型DataFrame来说可能是不现实的,因为它需要按天/小时重复行(从而创建具有数千万或数亿行的中间体)。
相反,使用pandas的Period作为重复行的单元似乎更合理。
简而言之,这将使用repeat在其边界处按周期中断行,并使用groupby.cumcount递增周期,使用start_time/end_time获得边界,最后使用groupby.sum添加每个周期的持续时间。
我自愿添加了几个中间变量,以使代码更容易理解,但如果需要,可以将许多步骤合并为一个步骤,以避免创建太多的中间列。

df[['Start', 'End']] = df[['Start', 'End']].apply(pd.to_datetime)

period = 'M'

out = (df.assign(start=df['Start'].dt.to_period(period),
                 end=df['End'].dt.to_period(period),
                 diff=lambda d: d['end'].sub(d['start']).apply(lambda x: x.n))
         .loc[lambda d: d.index.repeat(d['diff']+1)]
         .assign(n=lambda d: d.groupby(level=0).cumcount(),
                 Start=lambda d: d['Start'].mask(d.index.duplicated(), d['start'].add(d['n']).dt.start_time),
                 End=lambda d: d['End'].mask(d.index.duplicated(keep='last'), d['end'].add(d['n']).dt.start_time),
                 DurationDays=lambda d: d['End'].sub(d['Start']),
                 Year=lambda d: d['Start'].dt.year,
                 Month=lambda d: d['Start'].dt.month_name(),
                )
        .groupby(['ID', 'Year', 'Month'], as_index=False, sort=False)
       ['DurationDays'].sum()
      )

输出:

ID  Year Month     DurationDays
0   1  2023   May 11 days 12:00:00
1   1  2023  June  4 days 12:00:00
2   2  2023   May  3 days 00:00:00

groupby.sum之前的中间体:

ID               Start                 End     DurationDays    start      end  diff  n  Year Month
0   1 2023-05-20 12:00:00 2023-06-01 00:00:00 11 days 12:00:00  2023-05  2023-06     1  0  2023   May
0   1 2023-06-01 00:00:00 2023-06-03 12:00:00  2 days 12:00:00  2023-05  2023-06     1  1  2023  June
1   2 2023-05-20 12:00:00 2023-05-23 12:00:00  3 days 00:00:00  2023-05  2023-05     0  0  2023   May
2   1 2023-06-01 12:00:00 2023-06-03 12:00:00  2 days 00:00:00  2023-06  2023-06     0  0  2023  June
泛化

对于一个更通用的方法,您可以使用任何时期(如果您有足够的资源):

df[['Start', 'End']] = df[['Start', 'End']].apply(pd.to_datetime)

period = 'Y'

out = (df.assign(Period=df['Start'].dt.to_period(period),
                 end=df['End'].dt.to_period(period),
                 diff=lambda d: d['end'].sub(d['Period']).apply(lambda x: x.n))
         .loc[lambda d: d.index.repeat(d['diff']+1)]
         .assign(n=lambda d: d.groupby(level=0).cumcount(),
                 Start=lambda d: d['Start'].mask(d.index.duplicated(), d['Period'].add(d['n']).dt.start_time),
                 End=lambda d: d['End'].mask(d.index.duplicated(keep='last'), d['end'].add(d['n']).dt.start_time),
                 DurationDays=lambda d: d['End'].sub(d['Start']),
                )
        .groupby(['ID', 'Period'], as_index=False, sort=False)
       ['DurationDays'].sum()
      )

period = 'Y'的输出:

ID Period DurationDays
0   1   2023      16 days
1   2   2023       3 days

period = 'M'的输出:

ID   Period DurationDays
0   1  2023-05      14 days
1   2  2023-05       3 days
2   1  2023-06       2 days
hk8txs48

hk8txs482#

第一个想法是按天End/Start列的差异重复行,并按GroupBy.size计算每个ID/Year/Month的行数:

df[['Start','End']] = df[['Start','End']].apply(pd.to_datetime)

df = df.loc[df.index.repeat(df['End'].sub(df['Start']).dt.days)]

s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='d')
df['Date'] = df['Start'].add(s)

df = (df.groupby(['ID', df['Date'].dt.year.rename('Year'),
                       df['Date'].dt.month_name().rename('Month')], sort=False)
        .size()
        .reset_index(name='DurationDays'))
print (df)
   ID  Year Month  DurationDays
0   1  2023   May            12
1   1  2023  June             4
2   2  2023   May             3

为了获得更好的精度,可以使用小时而不是天:

df[['Start','End']] = df[['Start','End']].apply(pd.to_datetime)

df = df.loc[df.index.repeat(df['End'].sub(df['Start']).dt.total_seconds().div(3600))]

s = pd.to_timedelta(df.groupby(level=0).cumcount(), unit='h')
df['Start'] = df['Start'].add(s)

df = (df.assign(Year=df['Start'].dt.year,
                Month=df['Start'].dt.month_name())
          .groupby(['ID', 'Year','Month'], sort=False)
          .size()
          .div(24)
          .reset_index(name='DurationDays'))
print (df)
   ID  Year Month  DurationDays
0   1  2023   May          11.5
1   1  2023  June           4.5
2   2  2023   May           3.0
csbfibhn

csbfibhn3#

你可以lreshape你的DataFrame然后使用split_months(* 稍微调整 *):

tmp = (
    pd.lreshape(df, {'Datetime': ['Start', 'End']})
        .sort_values(by=['ID', 'Datetime'])
)

out = (
    tmp.groupby(['ID']).apply(
        lambda g: split_months(g['Datetime'].min(), g['Datetime'].max())
    ).reset_index('ID')
)

输出:

print(out)

   ID Month  Year  DurationDays
1   1   May  2023         11.00
2   1  June  2023          3.00 # <-- needs to be fixed (it is `2+3`)
1   2   May  2023          3.00

相关问题