pandas 如何在 Dataframe 中从多个可能性中找到最佳字符串匹配？

oxf4rvwz 于 2023-01-15 发布在其他

关注(0)|答案(1)|浏览(129)

我有一个DF，看起来像这样：

Row      Master                     Option1                  Option2
    1        00150042 plc               WAGON PLC                wegin llp
    2        01 telecom, ltd.           01 TELECOM LTD           telecom 1
    3        0404 investments limited   0404 Investments Ltd     404 Limited Investments

我尝试做的是将option1和option2列分别与主列进行比较，并获得每个列的相似性得分。
我已经得到了提供分数的代码：

from difflib import SequenceMatcher

    def similar(a, b):
         return SequenceMatcher(None, a, b).ratio()

我需要帮助的是如何实现这个逻辑。
它是否是一个for循环，将迭代Option1和master列，将分数保存在名为Option1_score的新列中，然后对Option2列执行相同的操作？
任何帮助都是高度赞赏!

pandas

来源：https://stackoverflow.com/questions/75110664/how-to-find-best-string-match-out-of-multiple-possibilities-in-a-dataframe

1条答案

按热度按时间

wsewodh21#

使用您提供的 Dataframe ：

import pandas as pd

df = pd.DataFrame(
    {
        "Row": [1, 2, 3],
        "Master": ["00150042 plc", "01 telecom, ltd.", "0404 investments limited"],
        "Option1": ["WAGON PLC", "01 TELECOM LTD", "0404 Investments Ltd"],
        "Option2": ["wegin llp", "telecom 1", "404 Limited Investments"],
    }
)

下面是使用Python f字符串和Pandas apply的一种方法：

for col in ["Option1", "Option2"]:
    df[f"{col}_score(%)"] = df.apply(
        lambda x: round(similar(x["Master"], x[col]) * 100, 1), axis=1
    )

然后：

print(df)
# Output
   Row                    Master               Option1  \
0    1              00150042 plc             WAGON PLC   
1    2          01 telecom, ltd.        01 TELECOM LTD   
2    3  0404 investments limited  0404 Investments Ltd   

                   Option2  Option1_score(%)  Option2_score(%)  
0                wegin llp               9.5              19.0  
1                telecom 1              26.7              64.0  
2  404 Limited Investments              81.8              63.8

赞(0）回复(0）举报 2023-01-15

我来回答

pandas 如何在 Dataframe 中从多个可能性中找到最佳字符串匹配？

1条答案

相关问题

热门标签

最新问答