python-3.x 在string.find()条件下连接两个 Dataframe

ygya80vv  于 2023-03-04  发布在  Python
关注(0)|答案(3)|浏览(140)
import pandas as pd
  
df1 = pd.DataFrame({'col_name':['12','13','14','15','16','17','18','19','20','21','22','23']})

df2 = pd.DataFrame({'col_name_aggr':['12|13|14', '10|21', '12|15|23'],
                    'color':['Blue', 'Red', 'Green']})

如何合并这两个 Dataframe ,使结果数据集包含来自df1的所有元素和一个额外的颜色列,当col_name为12、13或14时,该列的值为“蓝色”,当col_name为10或21时,该列的值为“红色”,当col_name为12、15或23时,该列的值为“绿色”,否则,该列的值为“无”?
据我所知,当我尝试使用merge()函数连接 Dataframe 时,我只能指定要合并它们的列。

pd.merge(df1, df2, left_on='col_name', right_on='col_name_aggr', how="left")

有没有办法指定合并它们的条件?
在SQL中,我会这样做:

SELECT
 df1.*,
 df2.color
FROM df1
left join df2 on CHARINDEX(df1.col_name,df2.col_name_aggr)<>0
t40tm48m

t40tm48m1#

您可以将df2从"宽"格式稍微转换为"长"格式。

import pandas as pd

df1 = pd.DataFrame({'col_name':['12','13','14','15','16','17','18','19','20','21','22','23']})

df2 = pd.DataFrame({'col_name_aggr':['12|13|14', '10|21', '12|15|23'],
                    'color':['Blue', 'Red', 'Green']})

df1.merge(
    df2.set_index('color')['col_name_aggr'].str.split('|').apply(pd.Series).stack().reset_index().rename(columns={0: 'col_name'})[['col_name', 'color']]
)
Out[1]: 
  col_name  color
0       12   Blue
1       12  Green
2       13   Blue
3       14   Blue
4       15  Green
5       21    Red
6       23  Green
    • 更新日期:**

实际上,@jezrael指出的explode方法可能更好:

df1.merge(
    df2.set_index('color')['col_name_aggr'].str.split('|').explode().reset_index().rename(columns={'col_name_aggr': 'col_name'})
)

Out[2]: 
  col_name  color
0       12   Blue
1       12  Green
2       13   Blue
3       14   Blue
4       15  Green
5       21    Red
6       23  Green
xvw2m8pv

xvw2m8pv2#

您可以按|拆分值,创建col_name列,并在需要匹配任何值时删除重复项:

df =  (df2.assign(col_name = df2['col_name_aggr'].str.split('|'))
          .explode('col_name')
          .drop_duplicates(['col_name','color']))
print (df)
  col_name  color
0       12   Blue
0       13   Blue
0       14   Blue
1       10    Red
1       21    Red
2       12  Green
2       15  Green
2       23  Green

out = df1.merge(df, how='left')
print (out)
   col_name  color
0        12   Blue
1        12  Green
2        13   Blue
3        14   Blue
4        15  Green
5        16    NaN
6        17    NaN
7        18    NaN
8        19    NaN
9        20    NaN
10       21    Red
11       22    NaN
12       23  Green

或者,如果需要合并所有连接的值,例如使用,

s =  (df2.assign(col_name_aggr = df2['col_name_aggr'].str.split('|'))
          .explode('col_name_aggr')
          .groupby('col_name_aggr')['color'].agg(', '.join)
          )
print (s)
col_name_aggr
10            Red
12    Blue, Green
13           Blue
14           Blue
15          Green
21            Red
23          Green
Name: color, dtype: object

out = df1.join(s.rename('colors'), how='left', on='col_name')
print (out)
   col_name       colors
0        12  Blue, Green
1        13         Blue
2        14         Blue
3        15        Green
4        16          NaN
5        17          NaN
6        18          NaN
7        19          NaN
8        20          NaN
9        21          Red
10       22          NaN
11       23        Green
4dc9hkyq

4dc9hkyq3#

使用一个Map来检查列名中的每个条目在df2[“color”]中的对应匹配位置:

def contain(x):
    contained = df2["col_name_aggr"].str.contains(x)
    return ",".join([x for x in df2["color"].loc[contained]]) if contained.any() else np.nan

df1["color"] = df1["col_name"].map(contain)

相关问题