我正在尝试优化一个返回值的函数给定条件的变量的(wage
)(MSA
内的最大注册数)。我认为将apply
和lambda
组合会更有效,但我的实际数据集很大(形状为321681x272),这使得计算速度非常慢。有没有更快的方法?我认为将操作矢量化而不是迭代df
可能是一种解决方案,但我不确定它作为df.apply
和lambda
的替代方案所遵循的结构
df = pd.DataFrame({'year': [2000, 2000, 2001, 2001],
'msa': ['NYC-Newark', 'NYC-Newark', 'NYC-Newark', 'NYC-Newark'],
'leaname':['NYC School District', 'Newark School District', 'NYC School District', 'Newark School District'],
'enroll': [100000,50000,110000,60000],
'wage': [5,2,7,3] })
def function1(x,y, var):
'''
Returns the selected variable's value for school district with largest enrollment in a given year
'''
t = df[(df['msa'] == x) & (df['year'] == y)]
e = pd.DataFrame(t.groupby(['msa',var]).mean()['enroll'])
return e.loc[e.groupby(level=[0])['enroll'].idxmax()].reset_index()[var]
df['main_city_wage'] = df.apply(lambda x: function1(x['msa'], x['year'], 'wage'), axis = 1)
输出示例
year msa leaname enroll wage main_wage
0 2000 NYC-Newark NYC School District 100000 5 5
1 2000 NYC-Newark Newark School District 50000 2 5
2 2001 NYC-Newark NYC School District 110000 7 7
3 2001 NYC-Newark Newark School District 60000 3 7
1条答案
按热度按时间8i9zcol21#
比如