pandas 根据列的值对两个不同数据框中的行求和

pgx2nnw8 于 2023-01-11 发布在其他

关注(0)|答案(2)|浏览(97)

我有两个 Dataframe
第一个月

ID  Year Primary_Location Secondary_Location  Sales
0           11  2023          NewYork            Chicago    100
1           11  2023             Lyon      Chicago,Paris    200
2           11  2023           Berlin              Paris    300
3           12  2022          Newyork            Chicago    150
4           12  2022             Lyon      Chicago,Paris    250
5           12  2022           Berlin              Paris    400

df2

ID  Year Primary_Location  Sales
0           11  2023          Chicago    150
1           11  2023            Paris    200
2           12  2022          Chicago    300
3           12  2022            Paris    350

对于具有相同ID和Year的每个组，我希望：将列Sales从df2添加到df1中的Sales，其中df2中的Primary_Location出现（包含）在df1中的Secondary_Location中。
例如：对于ID=11和Year=2023，Lyon的Sales将添加到Chicago的Sales和df_2的Paris的Sales。
该行的Paris的新Sales将为200 + 150 + 200 = 550。
预期产出为：

df_primary_output


            ID  Year Primary_Location Secondary_Location  Sales
0           11  2023          NewYork            Chicago    250
1           11  2023             Lyon      Chicago,Paris    550
2           11  2023           Berlin              Paris    500
3           12  2022          Newyork            Chicago    400
4           12  2022             Lyon      Chicago,Paris    900
5           12  2022           Berlin              Paris    750

pandas

来源：https://stackoverflow.com/questions/75073470/sum-rows-from-two-different-data-frames-based-on-the-value-of-columns

2条答案

按热度按时间

83qze16e1#

你的问题没那么简单...

- 拟议守则**

import pandas as pd

# Your dummy dataset
df1 = pd.DataFrame({'ID': [11, 11, 11, 12, 12, 12],
                   'Year': [2023, 2023, 2023, 2022, 2022, 2022],
                   'Primary_Location': ['NewYork', 'Lyon', 'Berlin', 'Newyork', 'Lyon', 'Berlin'],
                   'Secondary_Location': ['Chicago', 'Chicago,Paris', 'Paris', 'Chicago', 'Chicago,Paris', 'Paris'],
                   'Sales': [100, 200, 300, 150, 250, 400]
                   })

df2 = pd.DataFrame({'ID': [11, 11, 12, 12],
                   'Year': [2023, 2023, 2022, 2022],
                   'Primary_Location': ['Chicago', 'Paris', 'Chicago', 'Paris'],
                   'Sales': [150, 200, 300, 350]
                   })

def get_dict(g):
    return {g['Primary_Location'].iloc[i]:g['Sales'].iloc[i] for i in range(len(g))}
# return these dictionnaries
# {'Chicago': 100, 'Paris': 200}
# {'Chicago': 300, 'Paris': 350}

tot = []
def func(g):
    global df1, tot
    i, somme = [0]*2
    iterdf = df1.iterrows() # Dataframe iterator
    kv = get_dict(g) # Get dicts

    while i < len(df1):
        row = next(iterdf)[1]
        # Select specific df1 rows to modify by ID and Year criteria
        if g['ID'].iloc[1]==row['ID'] and g['Year'].iloc[1]==row['Year']:
            somme = row['Sales']
            for town in row['Secondary_Location'].split(','):
                if town in kv:
                    somme+=kv[town]
            tot.append(somme)
        i+=1

df2.groupby(['ID', 'Year']).apply(lambda g: func(g))
df1['Sales'] = tot
print(df1)

- 结果：**

ID  Year Primary_Location Secondary_Location  Sales
0  11  2023          NewYork            Chicago    250
1  11  2023             Lyon      Chicago,Paris    550
2  11  2023           Berlin              Paris    500
3  12  2022          Newyork            Chicago    450
4  12  2022             Lyon      Chicago,Paris    900
5  12  2022           Berlin              Paris    750

您确定第3行的结果是450而不是400吗？

赞(0）回复(0）举报 2023-01-11

dauxcl2d2#

这应该行得通：

s = 'Secondary_Location'
(df1.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)['Sales_2'].sum()
.add(df1['Sales']))

或

df3 = (df1.assign(Secondary_Location = df1['Secondary_Location'].str.split(','))
.explode('Secondary_Location'))

(df3[['ID','Year','Secondary_Location']].apply(tuple,axis=1)
.map(df2.set_index(['ID','Year','Primary_Location'])['Sales'])
.groupby(level=0).sum()
.add(df1['Sales']))

原始答复：

s = 'Secondary_Location'
(df.assign(Secondary_Location = lambda x: x[s].str.split(','))
.explode(s)
.join(df2.set_index(['ID','Year','Primary_Location'])['Sales'].rename('Sales_2'),on = ['ID','Year',s])
.groupby(level=0)
.agg({**dict.fromkeys(df,'first'),**{s:','.join,'Sales_2':'sum'}})
.assign(Sales = lambda x: x['Sales'] + x['Sales_2'])
.drop('Sales_2',axis=1))

输出：

ID  Year Primary_Location Secondary_Location  Sales
0  11  2023          NewYork            Chicago    250
1  11  2023             Lyon      Chicago,Paris    550
2  11  2023           Berlin              Paris    500
3  12  2022          Newyork            Chicago    450
4  12  2022             Lyon      Chicago,Paris    900
5  12  2022           Berlin              Paris    750

赞(0）回复(0）举报 2023-01-11

我来回答

pandas 根据列的值对两个不同数据框中的行求和

2条答案

相关问题

热门标签

最新问答