带条件的Pandas滚动平均

8i9zcol2  于 2022-12-21  发布在  其他
关注(0)|答案(3)|浏览(108)

我有一个数据集,其中包含每个用户每月登录次数的信息。我还有一些使用外部逻辑选择的月份。我想计算需要测试的月份在接下来4个月的平均值,以避免计算来自不同用户的登录次数。
下面是一些示例代码:

import pandas as pd

data = [
{ 'user': 'Mick', 'Month' : 4, 'logins': 5, 'test': True },
{ 'user': 'Mick', 'Month' : 5, 'logins': 4, 'test': False },
{ 'user': 'Mick', 'Month' : 6, 'logins': 7, 'test': False },
{ 'user': 'Mick', 'Month' : 7, 'logins': 5, 'test': True },
{ 'user': 'Mick', 'Month' : 8, 'logins': 12, 'test': False },
{ 'user': 'Mick', 'Month' : 9, 'logins': 8, 'test': False },
{ 'user': 'Tom', 'Month' : 4, 'logins': 15, 'test': True },
{ 'user': 'Tom', 'Month' : 5, 'logins': 24, 'test': False },
{ 'user': 'Tom', 'Month' : 6, 'logins': 17, 'test': False },
{ 'user': 'Tom', 'Month' : 7, 'logins': 22, 'test': False },
{ 'user': 'Tom', 'Month' : 8, 'logins': 20, 'test': False },
{ 'user': 'Tom', 'Month' : 9, 'logins': 18, 'test': True }
]

df = pd.DataFrame(data)

我提出了这个解决方案,但它是错误的,因为它不计算平均值,也不限制用户更改时的滚动

df['mean'] = df[df['test'] == True]['logins'].rolling(4).mean()
print(df)

我的预期结果如下所示。请注意:在第3行中,由于用户变更,仅计算3行的平均值:

user  Month  logins   test  mean
0   Mick      4       5   True   5.25     <-- mean of this row and the next 3
1   Mick      5       4  False   NaN
2   Mick      6       7  False   NaN
3   Mick      7       5   True   8.3333   <-- mean of this row and the next 2, because the user is different
4   Mick      8       2  False   NaN
5   Mick      9       8  False   NaN
6    Tom      4      15   True   19.5     <-- mean of this row and the next 3
7    Tom      5      24  False   NaN
8    Tom      6      17  False   NaN
9    Tom      7      22  False   NaN
10   Tom      8      20  False   NaN
11   Tom      9      18   True   18       <-- mean of this row alone because is the last one
r8xiu3jd

r8xiu3jd1#

您可以通过反转所有数据来进行前向滚动,然后按用户分组并执行滚动平均值。您需要通过排列索引恢复数据的原始顺序,最后屏蔽test不为True的行。因此:

df['mean_logins'] = (
    df.iloc[::-1] # reverse order for forward rolling
      .groupby('user') # perform the operation per user
      .rolling(4, min_periods=1) #min_periods=1 to get a result even if not 3 rows after 
      ['logins'].mean()
      .reset_index(level=0, drop=True) # drop the group name
      .sort_index() # to get original order
      .where(df['test']) # mask the results for the False test
)
print(df)
#     user  Month  logins   test  mean_logins
# 0   Mick      4       5   True     5.250000
# 1   Mick      5       4  False          NaN
# 2   Mick      6       7  False          NaN
# 3   Mick      7       5   True     8.333333
# 4   Mick      8      12  False          NaN
# 5   Mick      9       8  False          NaN
# 6    Tom      4      15   True    19.500000
# 7    Tom      5      24  False          NaN
# 8    Tom      6      17  False          NaN
# 9    Tom      7      22  False          NaN
# 10   Tom      8      20  False          NaN
# 11   Tom      9      18   True    18.000000
goqiplq2

goqiplq22#

通过按DataFrame.iloc交换行顺序仅选择具有True的行以进行前滚,每组使用Series.rolling,并按Series.droplevel删除MultiIndex的第一级:

df.loc[df['test'], 'new'] = (df.iloc[::-1]    
                               .groupby('user')['logins']
                               .rolling(4, min_periods=1)
                               .mean()
                               .droplevel(0))
print(df)
    user  Month  logins   test        new
0   Mick      4       5   True   5.250000
1   Mick      5       4  False        NaN
2   Mick      6       7  False        NaN
3   Mick      7       5   True   8.333333
4   Mick      8      12  False        NaN
5   Mick      9       8  False        NaN
6    Tom      4      15   True  19.500000
7    Tom      5      24  False        NaN
8    Tom      6      17  False        NaN
9    Tom      7      22  False        NaN
10   Tom      8      20  False        NaN
11   Tom      9      18   True  18.000000
m3eecexj

m3eecexj3#

def function1(dd:pd.DataFrame):
    return dd.assign(mean=dd.logins[::-1].rolling(4,1).mean().loc[dd.index].where(dd.test))

df.groupby('user').apply(function1)

    user  Month  logins   test       mean
0   Mick      4       5   True   5.250000
1   Mick      5       4  False        NaN
2   Mick      6       7  False        NaN
3   Mick      7       5   True   8.333333
4   Mick      8      12  False        NaN
5   Mick      9       8  False        NaN
6    Tom      4      15   True  19.500000
7    Tom      5      24  False        NaN
8    Tom      6      17  False        NaN
9    Tom      7      22  False        NaN
10   Tom      8      20  False        NaN
11   Tom      9      18   True  18.000000

相关问题