使用fuzzy wuzzy比较来自不同 Dataframe 的2列的部分匹配

w46czmvw  于 2021-09-29  发布在  Java
关注(0)|答案(1)|浏览(556)

我要比较此 Dataframe df1:

Product  Price
0               Waterproof Liner     40
1                   Phone Tripod     50
2               Waterproof Pants      0
3             baby Kids play Mat    985
4               Hiking BACKPACKS     34
5                security Camera    160

使用df2,如下所示:

Product      Id
0                    Home Security IP Camera  508760
1         Hiking Backpacks – Spring Products  287950
2                   Waterproof Eyebrow Liner  678897
3          Waterproof Pants – Winter Product  987340
4  Baby Kids Water Play Mat – Summer Product  111500

我想比较df1和df2中的产品列。以便找到产品的良好id。如果相似度<80,则会在id字段中输入'remove':df1和df2中产品列的文本不是100%匹配的。有人能帮我吗?或者我如何使用fuzzy wazzy获得良好的id?
这是我的密码

import pandas as pd
from fuzzywuzzy import process

data1 = {'Product1': ['Waterproof Liner','Phone Tripod','Waterproof Pants','baby Kids play Mat','Hiking BACKPACKS','security Camera'],
'Price':[40,50,0,985,34,160]}

data2 = {'Product2': ['Home Security IP Camera','Hiking Backpacks – Spring Products','Waterproof Eyebrow Liner',
        'Waterproof Pants – Winter Product','Baby Kids Water Play Mat – Summer Product'],
        'Id': [508760,287950,678897,987340,111500],}

df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)

dfm = pd.DataFrame(df1["Product1"].apply(lambda x: process.extractOne(x, df2["Product2"]))
                                   .tolist(), columns=['Product1',"match_comp", "Id"])

我得到的是:

Product1  match_comp  Id
0                   Waterproof Eyebrow Liner          86   2
1                   Waterproof Eyebrow Liner          50   2
2          Waterproof Pants – Winter Product          90   3
3  Baby Kids Water Play Mat – Summer Product          86   4
4         Hiking Backpacks – Spring Products          90   1
5                    Home Security IP Camera          86   0

预计会是什么:

Product  Price      ID
0    Waterproof Liner     40  678897
1        Phone Tripod     50  Remove
2    Waterproof Pants      0  987340
3  baby Kids play Mat    985  111500
4    Hiking BACKPACKS     34  287950
5     security Camera    160  508760
lp0sw83n

lp0sw83n1#

您可以创建一个 Package 器函数:

def extract(s):
    name,score,_ = process.extractOne(s, df2["Product2"], score_cutoff=0)
    if score < 80:
        return 'Remove'
    return df2.set_index('Product2').loc[name, 'Id']

df1['ID'] = df1["Product1"].apply(extract)

输出:

Product1  Price      ID
0    Waterproof Liner     40  678897
1        Phone Tripod     50  Remove
2    Waterproof Pants      0  987340
3  baby Kids play Mat    985  111500
4    Hiking BACKPACKS     34  287950
5     security Camera    160  508760

注意。输出不完全是您所期望的,您必须解释为什么应该删除第4/5行

相关问题