pandas 在DataFrame中为其他DataFrame中的每一行创建列并计算值

a64a0gku  于 2023-03-28  发布在  其他
关注(0)|答案(2)|浏览(124)

我有一个名为player的DataFrame:

player_df = pd.DataFrame(np.random.rand(10,3), columns=['x','y','more_cols'])

_____________________________________________________________________________
player_df:
          x         y  more_cols
0  0.352673  0.479360   0.638508
1  0.764669  0.326961   0.778483
2  0.805774  0.911662   0.316030
3  0.114446  0.185147   0.318742
4  0.714803  0.646525   0.084143
5  0.061614  0.837432   0.886669
6  0.179777  0.519559   0.446562
7  0.615326  0.886046   0.581127
8  0.597375  0.196619   0.310331
9  0.670061  0.471363   0.313047

第二个DataFrame称为checkpoints_df

checkpoints_df = pd.DataFrame(np.random.rand(4,2), columns=['x','y'])
checkpoints_df['checkpoint_name'] = ['Alpha', 'Hotel', 'Indigo', 'Papa']
_____________________________________________________________________________
checkpoints_df: 
          x         y checkpoint_names
0  0.616945  0.804442            Alpha
1  0.402007  0.556274            Hotel
2  0.478351  0.443920           Indigo
3  0.075494  0.803561             Papa

以及函数distance(x1,y1,x2,y2),其计算2个点之间的欧几里德距离,或者点的列表和点之间的欧几里德距离(x1和y1可以是列表)。
目标:我想为'checkpoints_df ['checkpoint']'中的每个检查点名称添加一列到'player_df',并用玩家到该检查点的距离填充该列(这是一个模型问题,真实的的问题可能有很多'checkpoints')
到目前为止,我已经尝试了很多方法,但最终选择了使用.iterrows()的解决方案,但是如果有很多检查点,这可能会很慢。这是我现在使用的方法:

for _, row in checkpoints_df.iterrows():
        player_df[row['name']] = distance(player_df['x'], player_df['y'], row['x'], row['y'])

我试过使用.apply(),但无法用此方法创建列。有没有一种方法可以在不迭代第二个 Dataframe 的情况下做到这一点?

roejwanj

roejwanj1#

如果你的distance函数是向量化的,你可以做:

import numpy as np

def distance(x1, y1, x2, y2):
    return np.sqrt((x2-x1)**2+(y2-y1)**2)

tmp = checkpoints_df.set_index('checkpoint_name')

for c in checkpoints_df.index:
    player_df[c] = distance(player_df['x'], player_df['y'],
                            tmp.loc[c, 'x'], tmp.loc[c, 'y'])

或者,完全矢量化:

player_df[checkpoints_df['checkpoint_name']] = distance(
    player_df['x'].to_numpy()[:,None], player_df['y'].to_numpy()[:,None],
    checkpoints_df['x'].to_numpy()[None], checkpoints_df['y'].to_numpy()[None]
)

输出:

x         y  more_cols     Alpha     Hotel    Indigo      Papa
0  0.548814  0.715189   0.602763  0.290325  0.173562  0.538927  0.116871
1  0.544883  0.423655   0.645894  0.448875  0.169807  0.560716  0.204632
2  0.437587  0.891773   0.963663  0.209178  0.323871  0.500542  0.325561
3  0.383442  0.791725   0.528895  0.120166  0.234831  0.404077  0.287810
4  0.568045  0.925597   0.071036  0.339141  0.374280  0.629699  0.311790
5  0.087129  0.020218   0.832620  0.774609  0.660846  0.601313  0.794770
6  0.778157  0.870012   0.978618  0.522455  0.441177  0.800208  0.302696
7  0.799159  0.461479   0.780529  0.619367  0.359296  0.795839  0.243226
8  0.118274  0.639921   0.143353  0.198590  0.345356  0.101950  0.494356
9  0.944669  0.521848   0.414662  0.725433  0.490735  0.930821  0.345899
vyswwuz2

vyswwuz22#

下面是一种利用numpy广播来创建所需输出的简洁方法:

def distance(x1,y1,x2,y2):
    return np.sqrt((x1-x2)**2+(y1-y2)**2)

df1, df2 = player_df, checkpoints_df
df1[df2.checkpoint_name] = distance(*[df.to_numpy() for df in 
    [df1[[c]] for c in 'xy'] + 
    [df2[[c]].T for c in 'xy']])

样品输入:

x  y  more_cols
0  0  0        0.0
1  0  1        0.5
2  0  2        1.0
3  0  3        1.5
4  0  4        2.0
5  0  5        2.5
6  0  6        3.0
7  0  7        3.5
8  0  8        4.0
9  0  9        4.5
   x  y checkpoint_name
0  0  0           Alpha
1  0  1           Hotel
2  0  2          Indigo
3  0  3            Papa

输出:

x  y  more_cols  Alpha  Hotel  Indigo  Papa
0  0  0        0.0    0.0    1.0     2.0   3.0
1  0  1        0.5    1.0    0.0     1.0   2.0
2  0  2        1.0    2.0    1.0     0.0   1.0
3  0  3        1.5    3.0    2.0     1.0   0.0
4  0  4        2.0    4.0    3.0     2.0   1.0
5  0  5        2.5    5.0    4.0     3.0   2.0
6  0  6        3.0    6.0    5.0     4.0   3.0
7  0  7        3.5    7.0    6.0     5.0   4.0
8  0  8        4.0    8.0    7.0     6.0   5.0
9  0  9        4.5    9.0    8.0     7.0   6.0

相关问题