使用Pandas优化apply和lambda函数

vs3odd8k  于 2022-12-02  发布在  其他
关注(0)|答案(1)|浏览(168)

我正在尝试优化一个返回值的函数给定条件的变量的(wage)(MSA内的最大注册数)。我认为将applylambda组合会更有效,但我的实际数据集很大(形状为321681x272),这使得计算速度非常慢。有没有更快的方法?我认为将操作矢量化而不是迭代df可能是一种解决方案,但我不确定它作为df.applylambda的替代方案所遵循的结构

df = pd.DataFrame({'year': [2000, 2000, 2001, 2001],
                    'msa': ['NYC-Newark', 'NYC-Newark', 'NYC-Newark', 'NYC-Newark'],
                  'leaname':['NYC School District', 'Newark School District', 'NYC School District', 'Newark School District'], 
                  'enroll': [100000,50000,110000,60000],
                   'wage': [5,2,7,3] })

def function1(x,y, var):
    '''
    Returns the selected variable's value for school district with largest enrollment in a given year
    '''

    t = df[(df['msa'] == x) & (df['year'] == y)]
    e = pd.DataFrame(t.groupby(['msa',var]).mean()['enroll'])
    return e.loc[e.groupby(level=[0])['enroll'].idxmax()].reset_index()[var]

df['main_city_wage'] = df.apply(lambda x: function1(x['msa'], x['year'], 'wage'), axis = 1)

输出示例

year         msa                 leaname  enroll  wage  main_wage

0  2000  NYC-Newark     NYC School District  100000     5          5
1  2000  NYC-Newark  Newark School District   50000     2          5
2  2001  NYC-Newark     NYC School District  110000     7          7
3  2001  NYC-Newark  Newark School District   60000     3          7
8i9zcol2

8i9zcol21#

比如

df['main_wage'] = df.set_index('wage').groupby(['year', 'msa'])['enroll'].transform('idxmax').values

相关问题