下面是我的代码片段。这对于大型 Dataframe 来说似乎非常慢,我在这里看不到任何改进的机会。我不能使用pandarallel,因为我已经在使用多处理。无论如何,为了加速这个片段:
def group_func(group):
school_open = (group['school_open'] == True)
exam = (group['exam_scheduled'] == True)
attendance_required = (group['att_flag'] == True)
score_mask = school_open & exam
attendance_mask = attendance_required & school_open
score = group.loc[score_mask, 'ind_score'].mean()
attendance = group.loc[attendance_mask, 'att'].mean()
active_day = group[attendance_mask]['dates'].nunique()
median_score = group.loc[score_mask, 'ind_score'].median()
return pd.Series({'score': score, 'attendance': attendance, 'active_day': active_day, 'median_score': median_score})
student_consolidated = student_df.groupby(['student_name', pd.Grouper(key='dates', freq='M')]).apply(group_func)
编辑:确认数据框:
import pandas as pd
import numpy as np
from faker import Faker
fake = Faker()
date_rng = pd.date_range(start='1/1/2023', end='12/31/2023', freq='D')
data = {'student_name': [fake.name() for i in range(len(date_rng)*100)],
'dates': np.tile(date_rng, 100),
'school_open': np.random.choice([True, False], size=len(date_rng)*100),
'att_flag': np.random.choice([True, False], size=len(date_rng)*100),
'exam_scheduled': np.random.choice([0, 1], size=len(date_rng)*100),
'ind_score': np.random.randint(1, 30, size=len(date_rng)*100),
'att': np.random.choice([True, False], size=len(date_rng)*100)}
student_df = pd.DataFrame(data)
1条答案
按热度按时间wljmcqd81#
编辑:无需多处理,可以计算
groupby
之外的掩码,然后使用比apply
更快的agg
:输出:
检查:
性能: