如何在pandas中有效地执行基于阈值重置的组的条件累积和?

rqdpfwrv  于 2023-11-15  发布在  其他
关注(0)|答案(3)|浏览(82)

给定一个包含“group”列、“value”列和"threshold“列的DataFrame,我需要对每个”group“中的”value“执行累积求和。
但是,每次超过为组指定的“阈值”时,累积总和都应重置。每个组的“阈值”可能不同。
下面是一个示例DataFrame:


import pandas as pd

data = {
    'group': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
    'value': [10, 20, 30, 40, 10, 20, 30, 40, 50],
    'threshold': [50, 50, 50, 50, 70, 70, 70, 100, 100]
}

df = pd.DataFrame(data)
  group  value  threshold
0     A     10         50
1     A     20         50
2     A     30         50
3     A     40         50
4     B     10         70
5     B     20         70
6     B     30         70
7     C     40        100
8     C     50        100




expected_df
  group  value  threshold  cumsum_reset
0     A     10         50            10
1     A     20         50            30
2     A     30         50            30
3     A     40         50            40
4     B     10         70            10
5     B     20         70            30
6     B     30         70            60
7     C     40        100            40
8     C     50        100            90

字符串
预期的输出应该向DataFrame添加一个新列“cumsum_reset”,显示在超过阈值后重置的累积和。
有人能提出一种有效的方法来在pandas中做到这一点,而不显式地迭代每行吗?

hsgswve4

hsgswve41#

尝试使用groupbyapply与自定义函数:

import pandas as pd

def cumsum_reset(group: pd.DataFrame) -> pd.Series:
  threshold = group['threshold'].iloc[0]
  cumsum = 0
  reset_values = []
  for value in group['value']:
    cumsum += value
    if cumsum > threshold:
      cumsum = value
    reset_values.append(cumsum)
  return pd.Series(reset_values, index=group.index)

data = {
  'group': ['A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C'],
  'value': [10, 20, 30, 40, 10, 20, 30, 40, 50],
  'threshold': [50, 50, 50, 50, 70, 70, 70, 100, 100]
}
df = pd.DataFrame(data)
df['cumsum_reset'] = df.groupby('group', group_keys=False).apply(lambda g: cumsum_reset(g))
print(df)

字符串

输出:

group  value  threshold  cumsum_reset
0     A     10         50            10
1     A     20         50            30
2     A     30         50            30
3     A     40         50            40
4     B     10         70            10
5     B     20         70            30
6     B     30         70            60
7     C     40        100            40
8     C     50        100            90

irtuqstp

irtuqstp2#

我建议使用numba来加速计算,例如:

import pandas as pd
from numba import njit

data = {
    "group": ["A", "A", "A", "A", "B", "B", "B", "C", "C"],
    "value": [10, 20, 30, 40, 10, 20, 30, 40, 50],
    "threshold": [50, 50, 50, 50, 70, 70, 70, 100, 100],
}
df = pd.DataFrame(data)

@njit
def get_cumsum_reset_numba(values, threshold, out):
    current = 0
    for i, v in enumerate(values):
        if current + v > threshold:
            out[i] = current
            current = 0
        else:
            current += v
            out[i] = current

def get_cumsum_reset(g):
    g["cumsum_reset"] = 0
    values = g["value"].to_numpy()
    threshold = g["threshold"].iat[0]
    out = g["cumsum_reset"].to_numpy()
    get_cumsum_reset_numba(values, threshold, out)
    return g

df = df.groupby("group", group_keys=False).apply(get_cumsum_reset)
print(df)

字符串
打印:

group  value  threshold  cumsum_reset
0     A     10         50            10
1     A     20         50            30
2     A     30         50            30
3     A     40         50            40
4     B     10         70            10
5     B     20         70            30
6     B     30         70            60
7     C     40        100            40
8     C     50        100            90

vjhs03f7

vjhs03f73#

两件事:

  • pandas applyvery slow and should be avoided whenever possible。这在这里是可能的。
  • Groupby在处理新组时自然会导致cumsum的重置。因此,这只是一个分组问题,不仅是通过'group',而且还通过超过阈值。

下面是一个不使用apply的方法:

# 1. Temporary column: first cumsum, grouped by 'group' only
df['cumsum_group'] = df.groupby(by='group')['value'].cumsum()

# 2. Temporary column: times the sum went over the threshold
df['dividend']     = df['cumsum_group'] // df['threshold']

# 3. Final cumsum by double groupby
df['cumsum_reset'] = df.groupby(by=['group','dividend'])['value'].cumsum()
df

  group  value  threshold  cumsum_group  dividend  cumsum_reset
0     A     10         50            10         0            10
1     A     20         50            30         0            30
2     A     30         50            60         1            30
3     A     40         50           100         2            40
4     B     10         70            10         0            10
5     B     20         70            30         0            30
6     B     30         70            60         0            60
7     C     40        100            40         0            40
8     C     50        100            90         0            90

字符串
最后删除临时列:

df.drop(columns=['cumsum_group','dividend'], inplace=True)


所有这些都可以总结在这个等效的一行代码中:

df['cumsum_reset'] = df.groupby(by=[df['group'], 
                                    df.groupby(by='group')['value'].cumsum() // df['threshold']]
                               )['value'].cumsum()

相关问题