基于正则表达式追加Pandas Dataframe 的列

ss2ws0br  于 2023-01-28  发布在  其他
关注(0)|答案(2)|浏览(135)

我有两个 Dataframe ,我想根据正则表达式追加。如果df1的"code"列中的值(例如R93)与df2的"ICD_CODE"(例如R93)匹配,则将"code"列值追加到df2。

df1
code
R93.2
S03

df2
ICD_CODE    ICD_term                        MDR_code    MDR_term    
R93.1       Acute abdomen                   10000647    Acute abdomen   
K62.4       Stenosis of anus and rectum     10002581    Anorectal stenosis
S03.1       Hand-Schüller-Christian disease 10053135    Hand-Schueller-Christian disease

预期输出为:

code    ICD_CODE    ICD_term                        MDR_code    MDR_term    
R93.2   R93.1       Acute abdomen                   10000647    Acute abdomen   
S03     S03.1       Hand-Schüller-Christian disease 10053135    Hand-Schueller-Christian disease

任何帮助都是高度赞赏!

yhqotfr8

yhqotfr81#

保留每个代码列的左侧部分(点之前)作为merge键:

out = (df1.merge(df2, left_on=df1['code'].str.split('.').str[0], 
                right_on=df2['ICD_CODE'].str.split('.').str[0])
          .drop(columns='key_0'))
print(out)

# Output
    code ICD_CODE                         ICD_term  MDR_code                          MDR_term
0  R93.2    R93.1                    Acute abdomen  10000647                     Acute abdomen
1    S03    S03.1  Hand-Schüller-Christian disease  10053135  Hand-Schueller-Christian disease
bqf10yzr

bqf10yzr2#

一个可能的 * 解决方案 * 是从fuzzywuzzy使用process.extractOne

#pip install fuzzywuzzy
from fuzzywuzzy import process
​
out = (df1.assign(matched_code=df1["code"].apply(lambda x: process.extractOne(x, df2["ICD_CODE"])[0]))
          .merge(df2, left_on="matched_code", right_on="ICD_CODE")
          .drop(columns="matched_code")
       )

输出:

print(out)
​
    code ICD_CODE                         ICD_term  MDR_code                          MDR_term
0  R93.2    R93.1                    Acute abdomen  10000647                     Acute abdomen
1    S03    S03.1  Hand-Schüller-Christian disease  10053135  Hand-Schueller-Christian disease

相关问题