pandas 将iterrows()更改为.loc以用于大型 Dataframe

dauxcl2d  于 2023-03-28  发布在  其他
关注(0)|答案(3)|浏览(146)

我有两个 Dataframe ,df 1和df 2。
基于df1中的条件day_of_week == 7,我们必须匹配2个其他列值,(statWeek and statMonth)如果条件匹配,那么我们必须将df 2中的as_cost_perf替换为df 1中的cost_eu
下面是我使用iterrows()的代码块
如果我有一个很大 Dataframe ,那么它会很耗时,有人能帮我优化这个片段吗?

import pandas as pd

# create df1
data1 = {'day_of_week': [7, 7, 6],
         'statWeek': [1, 2, 3],
         'statMonth': [1, 1, 1],
         'cost_eu': [957940.0, 942553.0, 1177088.0]}
df1 = pd.DataFrame(data1)

# create df2
data2 = {'statWeek': [1, 2, 3, 4, 1, 2, 3],
         'statMonth': [1, 1, 1, 1, 2, 2, 2],
         'as_cost_perf': [344560.0, 334580.0, 334523.0, 556760.0, 124660.0, 124660.0, 763660.0]}
df2 = pd.DataFrame(data2)

# identify rows in df1 where day_of_week == 7
mask = df1['day_of_week'] == 7

# update df2 with cost_eu from df1 where there is a match
for i, row in df1[mask].iterrows():
    matching_rows = df2[(df2['statWeek'] == row['statWeek']) & (df2['statMonth'] == row['statMonth'])]
    if not matching_rows.empty:
        df2.loc[matching_rows.index, 'as_cost_perf'] = row['cost_eu']

# print the updated df2
df2

先谢了!

agxfikkp

agxfikkp1#

您可以重新格式化df1并将其与df2连接,然后删除重复项:

upd = df1[df1['day_of_week'].eq(7)].rename(columns={'cost_eu': 'as_cost_perf'}).drop(columns='day_of_week')
out = pd.concat([upd, df2], axis=0).drop_duplicates(['statWeek', 'statMonth'])

为了避免drop_duplicates,您可以简单地从df2中删除相同的行:

upd = df1[df1['day_of_week'].eq(7)].rename(columns={'cost_eu': 'as_cost_perf'}).drop(columns='day_of_week')

cols = ['statWeek', 'statMonth']
m = ~df2[cols].isin(upd[cols]).all(axis=1)
out = pd.concat([upd, df2.loc[m]], axis=0)

输出:

>>> out
   statWeek  statMonth  as_cost_perf
0         1          1      957940.0
1         2          1      942553.0
2         3          1      334523.0
3         4          1      556760.0
4         1          2      124660.0
5         2          2      124660.0
6         3          2      763660.0

>>> upd
   statWeek  statMonth  as_cost_perf
0         1          1      957940.0
1         2          1      942553.0
m0rkklqb

m0rkklqb2#

代替for循环,您可以使用单一重新分配应用df.merge

mask = df1['day_of_week'] == 7
df2 = df2.merge(df1[mask], on=['statWeek', 'statMonth'], how='left')
matched = ~df2['cost_eu'].isna()
df2.loc[matched, 'as_cost_perf'] = df2.loc[matched, 'cost_eu']
df2.drop(['day_of_week', 'cost_eu'], axis=1, inplace=True)
statWeek  statMonth  as_cost_perf
0         1          1      957940.0
1         2          1      942553.0
2         3          1      334523.0
3         4          1      556760.0
4         1          2      124660.0
5         2          2      124660.0
6         3          2      763660.0
qpgpyjmq

qpgpyjmq3#

您可以使用mergeupdate,但首先我们需要过滤器df1,因为通过执行df1.loc[df1['day_of_week'].eq(7), 'statWeek':],您只关心day_of_week == 7
∮ ∮ ∮

df2.merge(df1.loc[df1['day_of_week'].eq(7), 'statWeek':],
          on=['statWeek', 'statMonth'], how='left')

   statWeek  statMonth  as_cost_perf   cost_eu
0         1          1      344560.0  957940.0
1         2          1      334580.0  942553.0
2         3          1      334523.0       NaN
3         4          1      556760.0       NaN
4         1          2      124660.0       NaN
5         2          2      124660.0       NaN
6         3          2      763660.0       NaN

∮ ∮ ∮

# we need to set the index if we use update
df2 = df2.set_index(['statWeek', 'statMonth'])
# we set the index for df1.loc[...] and rename the cost_eu column to match df2
df2.update(df1.loc[df1['day_of_week'].eq(7), 'statWeek':]\
           .set_index(['statWeek', 'statMonth']).rename(columns={'cost_eu': 'as_cost_perf'}))

print(df2.reset_index())

   statWeek  statMonth  as_cost_perf
0         1          1      957940.0
1         2          1      942553.0
2         3          1      334523.0
3         4          1      556760.0
4         1          2      124660.0
5         2          2      124660.0
6         3          2      763660.0

相关问题