Pandas groupby对于大型 Dataframe 非常慢

nc1teljy  于 2023-05-05  发布在  其他
关注(0)|答案(1)|浏览(191)

下面是我的代码片段。这对于大型 Dataframe 来说似乎非常慢,我在这里看不到任何改进的机会。我不能使用pandarallel,因为我已经在使用多处理。无论如何,为了加速这个片段:

def group_func(group):
    school_open = (group['school_open'] == True)
    exam = (group['exam_scheduled'] == True)
    attendance_required = (group['att_flag'] == True)

    score_mask = school_open & exam
    attendance_mask = attendance_required & school_open

    score = group.loc[score_mask, 'ind_score'].mean()
    attendance = group.loc[attendance_mask, 'att'].mean()
    active_day = group[attendance_mask]['dates'].nunique()
    median_score = group.loc[score_mask, 'ind_score'].median()

    return pd.Series({'score': score, 'attendance': attendance, 'active_day': active_day, 'median_score': median_score})

student_consolidated = student_df.groupby(['student_name', pd.Grouper(key='dates', freq='M')]).apply(group_func)

编辑:确认数据框:

import pandas as pd
import numpy as np
from faker import Faker

fake = Faker()

date_rng = pd.date_range(start='1/1/2023', end='12/31/2023', freq='D')
data = {'student_name': [fake.name() for i in range(len(date_rng)*100)],
        'dates': np.tile(date_rng, 100),
        'school_open': np.random.choice([True, False], size=len(date_rng)*100),
        'att_flag': np.random.choice([True, False], size=len(date_rng)*100),
        'exam_scheduled': np.random.choice([0, 1], size=len(date_rng)*100),
        'ind_score': np.random.randint(1, 30, size=len(date_rng)*100),
        'att': np.random.choice([True, False], size=len(date_rng)*100)}

student_df = pd.DataFrame(data)
wljmcqd8

wljmcqd81#

编辑:无需多处理,可以计算groupby之外的掩码,然后使用比apply更快的agg

score_mask = df['school_open'] & df['exam_scheduled']
attendance_mask = df['att_flag'] & df['school_open']

out = (df.assign(ind_score=lambda x: x['ind_score'].where(score_mask),
                att=lambda x: x['att'].where(attendance_mask),
                dates2=lambda x: x['dates'].where(attendance_mask))
         .groupby(['student_name', pd.Grouper(key='dates', freq='M')])
         .agg(score=('ind_score', 'mean'), attendance=('att', 'mean'),
              active_day=('dates2', 'nunique'), median_score=('ind_score', 'median')))

输出:

>>> out
                           score attendance  active_day  median_score
student_name   dates                                                 
Aaron Anderson 2023-01-31   16.0        0.0           1          16.0
               2023-09-30    1.0        NaN           0           1.0
Aaron Arias    2023-08-31   29.0        0.0           1          29.0
Aaron Atkins   2023-07-31    NaN        NaN           0           NaN
Aaron Baldwin  2023-03-31    NaN        NaN           0           NaN
...                          ...        ...         ...           ...
Zoe Riley      2023-02-28    NaN        NaN           0           NaN
Zoe Rodgers    2023-01-31    NaN        NaN           0           NaN
Zoe Swanson    2023-04-30    NaN        NaN           0           NaN
Zoe Vasquez    2023-03-31   22.0        0.0           1          22.0
Zoe White      2023-01-31    NaN        NaN           0           NaN

[35699 rows x 4 columns]

检查:

#     v-- agg             apply --v
>>> (out.fillna(-9999) == student_consolidated.fillna(-9999)).all().all()
True

性能:

>>> %timeit -n 1 -r 1 groupby_agg(student_df)
3.47 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

>>> %timeit -n 1 -r 1 groupby_apply(student_df)
48.2 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)

相关问题