我正在比较同一个dataframe的不同行的值,并使用前面匹配的行的数据更新score
列。下面的代码可以很好地满足我的需要,但是它不是一个有效的解决方案,并且需要几个小时才能运行20K DF。
什么是更快的矢量化方法来实现相同的结果?
我尝试了下面的迭代方法:
df = pd.DataFrame({'date': ['2022-10-01', '2022-10-02', '2022-10-03', '2022-10-01', '2022-10-02'],
'symbol': ['XYZ', 'XYZ', 'XYZ', 'ABC', 'DEF'],
'tenor': ['2022-10-31', '2022-11-30', '2022-12-31', '2022-10-31', '2022-11-30'],
'score': [2, 3, 4, 6, 7],
})
ref_symbol = 'XYZ'
df_xyz = df.loc[df['symbol'] == ref_symbol]
sym_to_compare = list(df.symbol.unique())
sym_to_compare.remove(ref_symbol)
score_columns = [f'score_{sym}' for sym in sym_to_compare]
df[score_columns] = [np.NAN] * len(sym_to_compare)
for idx, row in df_xyz.iterrows():
for sym in sym_to_compare:
match = df.loc[(df.symbol == sym) & (df.tenor == row.tenor) & (df.date == row.date)]
if len(match.index):
df.at[idx, f'score_{sym}'] = match.score
转换前后的 Dataframe :
# Original DF
date symbol tenor score
0 2022-10-01 XYZ 2022-10-31 2
1 2022-10-02 XYZ 2022-11-30 3
2 2022-10-03 XYZ 2022-12-31 4
3 2022-10-01 ABC 2022-10-31 6
4 2022-10-02 DEF 2022-11-30 7
# Transformed DF
date symbol tenor score score_ABC score_DEF
0 2022-10-01 XYZ 2022-10-31 2 6.0 NaN
1 2022-10-02 XYZ 2022-11-30 3 NaN 7.0
2 2022-10-03 XYZ 2022-12-31 4 NaN NaN
3 2022-10-01 ABC 2022-10-31 6 NaN NaN
4 2022-10-02 DEF 2022-11-30 7 NaN NaN
2条答案
按热度按时间ct2axkht1#
这应该以更快的方式完成工作:
其输出:
如果需要,可以再次追加
sym_to_compare
行。rdrgkggo2#
我将使用
pivot
和merge
:输出: