pandas AND条件和严格相等

uemypmqf  于 2023-04-18  发布在  其他
关注(0)|答案(1)|浏览(140)

我需要将df 1的DocN列中的发票付款与df 2的TXT列中的数据进行匹配。根据两个数据集中的匹配文章打印文档(DocN)+金额(DocSum)以及相应付款的详细信息(DocP,Date

import numpy as np
import re
data1 = {
"DocN": ['140111038-001', '7314560', '169233301-001','ЕКТ01886853','ЕКТ02126350','30262-19',
'27283-19','746'],
"DocSum": ['358,80', '1487,45', '7458,78','2478,12','9624,95','3247,32',
'3224,25','32587,22'],
"DocArt1" : ['85647', '85475', '21457', '12746', '25472', '58123', '74185', '82274']
}
df1 = pd.DataFrame(data1)

data2 = {
"TXT": ['payment by document 30262-19, 30317-19, 30329-19, 31270-19, 32038-19, 26713-19,26715-19, ЕКТ01886853 ',
'payment by document 26721-19, 26748-19, 29835-19, 31112-19, 26746-19, 30041-19, 23150-19, ',
'payment by document 23525-19, 25050-19, 26244-19, 27997-19, 28032-19,30278-19, ЕКТ01886853',
'payment by document 29227-19, 29713-19, 27283-19, 32003-19, 29235-19, 29888-19, 7314560',
'payment by document 175634096-001, 175634109-001, 175623281-001,175638863-001, 140111038-001, 7314560'],
"DocP": [112, 113, 114, 115, 116],
"Date": ["25.01.2022", "26.01.2022", "27.01.2022", "28.01.2022", "29.01.2022"],
"DocArt2" : ['12746','74585','25489','85475','85875']
}

df2 = pd.DataFrame(data2)

print(df1)
print(df2)

我正在申请:
df1.join(df1.DocN.apply(lambda x: pd.Series(df2.loc[df2['TXT'].str.contains(fr'\b{x}\b')& (df1['DocArt1'] == df2['DocArt2']),['DocP','Date']].to_dict('list'))))
我期待:
| 指数|DocN|DocSum|DocArt1|DocP|日期|
| --------------|--------------|--------------|--------------|--------------|--------------|
| 0|140111038-001|358,80|85647|一百一十六|2022年1月29日|
| 1|七三一四五六零|1487.45|85475|一百一十五|2022年1月28日|
| 二|169233301-001|七四五八,七十八|二一四五七|||
| 三|沪公网安备31010502000112号|2478.12|12746|一百一十二|25.01.2022|
| 四|沪公网安备31010502000112号|9624,95|25472|||
| 五|30262-19|3247,32|五八一二三|一百一十二|25.01.2022|
| 六|27283-19|3224.25|七四一八五|一百一十五|2022年1月28日|
| 七|七四六|32587,22|82274|||

mzillmmw

mzillmmw1#

第一个解决方案是Series.str.findallDataFrame.explode,如果需要,则通过DocArtDocN两列进行匹配:

pat = r"\b({})\b".format("|".join(re.escape(x) for x in df1.DocN))

need = ['DocArt2','DocN','DocP','Date']
df22 = df2.assign(DocN= df2['TXT'].str.findall(pat)).explode('DocN')[need]
#alternative solution
#df22 = df2.join(df2['TXT'].str.extractall(pat)[0].droplevel(1).rename('DocN'))[need]
print (df22)
  DocArt2           DocN  DocP        Date
0   12746       30262-19   112  25.01.2022
0   12746    ЕКТ01886853   112  25.01.2022
1   74585            NaN   113  26.01.2022
2   25489    ЕКТ01886853   114  27.01.2022
3   85475       27283-19   115  28.01.2022
3   85475        7314560   115  28.01.2022
4   85875  140111038-001   116  29.01.2022
4   85875        7314560   116  29.01.2022

df = df1.merge(df22, left_on=['DocArt1','DocN'], right_on=['DocArt2','DocN'], how='left')
print (df)

            DocN    DocSum DocArt1 DocArt2   DocP        Date
0  140111038-001    358,80   85647     NaN    NaN         NaN
1        7314560   1487,45   85475   85475  115.0  28.01.2022
2  169233301-001   7458,78   21457     NaN    NaN         NaN
3    ЕКТ01886853   2478,12   12746   12746  112.0  25.01.2022
4    ЕКТ02126350   9624,95   25472     NaN    NaN         NaN
5       30262-19   3247,32   58123     NaN    NaN         NaN
6       27283-19   3224,25   74185     NaN    NaN         NaN
7            746  32587,22   82274     NaN    NaN         NaN

如果需要仅通过DocN匹配,并通过相同的DocN聚合:

pat = r"\b({})\b".format("|".join(re.escape(x) for x in df1.DocN))

f = lambda x: ', '.join(x.astype(str))
df22 = (df2.assign(DocN= df2['TXT'].str.findall(pat)).explode('DocN')
           .groupby('DocN')[['DocP','Date']].agg(f))
print (df22)
                   DocP                    Date
DocN                                           
140111038-001       116              29.01.2022
27283-19            115              28.01.2022
30262-19            112              25.01.2022
7314560        115, 116  28.01.2022, 29.01.2022
ЕКТ01886853    112, 114  25.01.2022, 27.01.2022

df = df1.merge(df22, on='DocN', how='left')
print (df)
            DocN    DocSum DocArt1      DocP                    Date
0  140111038-001    358,80   85647       116              29.01.2022
1        7314560   1487,45   85475  115, 116  28.01.2022, 29.01.2022
2  169233301-001   7458,78   21457       NaN                     NaN
3    ЕКТ01886853   2478,12   12746  112, 114  25.01.2022, 27.01.2022
4    ЕКТ02126350   9624,95   25472       NaN                     NaN
5       30262-19   3247,32   58123       112              25.01.2022
6       27283-19   3224,25   74185       115              28.01.2022
7            746  32587,22   82274       NaN                     NaN

因为否则会得到重复的新行:

pat = r"\b({})\b".format("|".join(re.escape(x) for x in df1.DocN))

need = ['DocN','DocP','Date']
df22 = df2.assign(DocN= df2['TXT'].str.findall(pat)).explode('DocN')[need]
#alternative solution
#df22 = df2.join(df2['TXT'].str.extractall(pat)[0].droplevel(1).rename('DocN'))[need]
print (df22)
            DocN  DocP        Date
0       30262-19   112  25.01.2022
0    ЕКТ01886853   112  25.01.2022
1            NaN   113  26.01.2022
2    ЕКТ01886853   114  27.01.2022
3       27283-19   115  28.01.2022
3        7314560   115  28.01.2022
4  140111038-001   116  29.01.2022
4        7314560   116  29.01.2022

df = df1.merge(df22, on='DocN', how='left')
print (df)
            DocN    DocSum DocArt1   DocP        Date
0  140111038-001    358,80   85647  116.0  29.01.2022
1        7314560   1487,45   85475  115.0  28.01.2022
2        7314560   1487,45   85475  116.0  29.01.2022
3  169233301-001   7458,78   21457    NaN         NaN
4    ЕКТ01886853   2478,12   12746  112.0  25.01.2022
5    ЕКТ01886853   2478,12   12746  114.0  27.01.2022
6    ЕКТ02126350   9624,95   25472    NaN         NaN
7       30262-19   3247,32   58123  112.0  25.01.2022
8       27283-19   3224,25   74185  115.0  28.01.2022
9            746  32587,22   82274    NaN         NaN

相关问题