python 比较同一数据框中的行并更新/创建新列

yhqotfr8  于 2023-03-28  发布在  Python
关注(0)|答案(2)|浏览(102)

我正在比较同一个dataframe的不同行的值,并使用前面匹配的行的数据更新score列。下面的代码可以很好地满足我的需要,但是它不是一个有效的解决方案,并且需要几个小时才能运行20K DF。
什么是更快的矢量化方法来实现相同的结果?
我尝试了下面的迭代方法:

df = pd.DataFrame({'date': ['2022-10-01', '2022-10-02', '2022-10-03', '2022-10-01', '2022-10-02'],
                   'symbol': ['XYZ', 'XYZ', 'XYZ', 'ABC', 'DEF'],
                   'tenor': ['2022-10-31', '2022-11-30', '2022-12-31', '2022-10-31', '2022-11-30'],
                   'score': [2, 3, 4, 6, 7],
                   })
ref_symbol = 'XYZ'
df_xyz = df.loc[df['symbol'] == ref_symbol]
sym_to_compare = list(df.symbol.unique())
sym_to_compare.remove(ref_symbol)
score_columns = [f'score_{sym}' for sym in sym_to_compare]
df[score_columns] = [np.NAN] * len(sym_to_compare)

for idx, row in df_xyz.iterrows():
    for sym in sym_to_compare:
        match = df.loc[(df.symbol == sym) & (df.tenor == row.tenor) & (df.date == row.date)]
        if len(match.index):
            df.at[idx, f'score_{sym}'] = match.score

转换前后的 Dataframe :

# Original DF
         date symbol       tenor  score
0  2022-10-01    XYZ  2022-10-31      2
1  2022-10-02    XYZ  2022-11-30      3
2  2022-10-03    XYZ  2022-12-31      4
3  2022-10-01    ABC  2022-10-31      6
4  2022-10-02    DEF  2022-11-30      7

# Transformed DF
         date symbol       tenor  score  score_ABC  score_DEF
0  2022-10-01    XYZ  2022-10-31      2        6.0        NaN
1  2022-10-02    XYZ  2022-11-30      3        NaN        7.0
2  2022-10-03    XYZ  2022-12-31      4        NaN        NaN
3  2022-10-01    ABC  2022-10-31      6        NaN        NaN
4  2022-10-02    DEF  2022-11-30      7        NaN        NaN
ct2axkht

ct2axkht1#

这应该以更快的方式完成工作:

ref_symbol = "XYZ"
sym_to_compare = set(df["symbol"].unique()) - set([ref_symbol])
ref_df = df[df["symbol"] == ref_symbol].copy()
for s in sym_to_compare:
    curr_df = df[df["symbol"] == s]
    ref_df = ref_df.merge(
        curr_df[["tenor", "score"]]
        .rename(columns={"score": f"score_{s}"}), 
        on="tenor", 
        how="left"
     )

其输出:

date symbol       tenor  score  score_ABC  score_DEF
0  2022-10-01    XYZ  2022-10-31      2        6.0        NaN
1  2022-10-02    XYZ  2022-11-30      3        NaN        7.0
2  2022-10-03    XYZ  2022-12-31      4        NaN        NaN

如果需要,可以再次追加sym_to_compare行。

rdrgkggo

rdrgkggo2#

我将使用pivotmerge

# symbol to use as reference
ref_symbol = 'XYZ'
# columns to use a merging key
cols = ['tenor', 'date']

m = df['symbol'].eq(ref_symbol)

out = df.merge(
    pd.concat({ref_symbol: (
                df[~m].pivot(index=cols, columns='symbol',
                             values='score')
                      .add_prefix('score_')
            )}, names=['symbol']).reset_index(),
    how='left',
)

输出:

date symbol       tenor  score  score_ABC  score_DEF
0  2022-10-01    XYZ  2022-10-31      2        6.0        NaN
1  2022-10-02    XYZ  2022-11-30      3        NaN        7.0
2  2022-10-03    XYZ  2022-12-31      4        NaN        NaN
3  2022-10-01    ABC  2022-10-31      6        NaN        NaN
4  2022-10-02    DEF  2022-11-30      7        NaN        NaN

相关问题