pandas 我需要一种方法来比较python中的两个字符串,而不使用panda Dataframe 中的集合

muk1a3rh  于 2022-11-20  发布在  Python
关注(0)|答案(3)|浏览(135)

我目前正在制作一个大的csv文件,我需要查找并打印选定行与其他行之间的相似性。例如,如果字符串为“Card”,第二个字符串为“Credit Card Debit Card”,则应返回2;或者如果第一个字符串为“Credit Card”,第二个字符串为“Credit Card Debit Card”,则应返回2它应该返回3,因为有3个单词与第一个字符串匹配.我尝试使用集合解决这个问题,但是由于集合是唯一的,并且在第一个例子中不包含重复项,所以它返回1,而不是2.因为在集合“CreditCardDebitCard”中是{“Credit”,“Card”,“Debit”}。有什么方法可以计算这个吗?相似度的公式是((numberOfSameWords)/whichStringisLonger)*100,如这张照片中所解释的:

我尝试了很多类似Jaccard Similarity的方法,但是它们都是使用集合的,并且返回错误的答案。谢谢你的帮助。我尝试运行的代码是:

def test(row1, row2):
    return str(round(len(np.intersect1d(row1.split(), row2.split())) / max(len(row1.split()), len(row2.split()))*100, 2))

data = int(input("Which index should be tested:"))
for j in range(0,10):
    print(test(dff['Product'].iloc[data], dff['Product'].iloc[j]))

我的 Dataframe 现在看起来像这样:

print(df.sample(10).to_dict("list"))返回给我:

{'Product': ['Bank account or service', 'Credit card', 'Credit reporting', 'Credit reporting credit repair services or other personal consumer reports', 'Credit reporting', 'Mortgage', 'Debt collection', 'Mortgage', 'Mortgage', 'Credit reporting'], 'Issue': ['Deposits and withdrawals', 'Billing disputes', 'Incorrect information on credit report', "Problem with a credit reporting company's investigation into an existing problem", 'Incorrect information on credit report', 'Applying for a mortgage or refinancing an existing mortgage', 'Disclosure verification of debt', 'Loan servicing payments escrow account', 'Loan servicing payments escrow account', 'Incorrect information on credit report'], 'Company': ['CITIBANK NA', 'FIRST NATIONAL BANK OF OMAHA', 'EQUIFAX INC', 'Experian Information Solutions Inc', 'Experian Information Solutions Inc', 'BANK OF AMERICA NATIONAL ASSOCIATION', 'AllianceOne Recievables Management', 'SELECT PORTFOLIO SERVICING INC', 'OCWEN LOAN SERVICING LLC', 'Experian Information Solutions Inc'], 'State': ['CA', 'WA', 'FL', 'UT', 'MI', 'CA', 'WA', 'IL', 'TX', 'CA'], 'ZIP_code': ['92606', '98272', '329XX', '84321', '486XX', '94537', '984XX', '60473', '76247', '91401'], 'Complaint_ID': [90452, 2334443, 1347696, 2914771, 1788024, 2871939, 1236424, 1619712, 2421373, 1803691]}
uqcuzwp8

uqcuzwp81#

您可以尝试以下操作:

import pandas as pd

l1 = ["Debt collection", "Debt collection", "Managing loan lease", "Managing loan lease",
      "Credit reporting credit repair services personal consumer reports", "Credit reporting credit repair services personal consumer report"]
l2 = ["Debt collection", "Mortgage", "Problems end loan lease", "Struggling pay loan",
      "Payday loan title loan personal loan", "Credit card prepaid card"]

df = pd.DataFrame(l1, columns=["col1"])
df["col2"] = l2

def similarity(row1, row2):
    # calculate longest row
    longestSentence = 0
    commonWords = 0
    wordsRow1 = [x.upper() for x in row1.split()]
    wordsRow2 = [x.upper() for x in row2.split()]
    # calculate similar words in both sentences
    common = list(set(wordsRow1).intersection(wordsRow2))
    if len(wordsRow1) > len(wordsRow2):
        longestSentence = len(wordsRow1)
        commonWords = calculate(common, wordsRow1)
    else:
        longestSentence = len(wordsRow2)
        commonWords = calculate(common, wordsRow2)
    return (commonWords / longestSentence) * 100

def calculate(common, longestRow):
    sum = 0
    for word in common:
        sum += longestRow.count(word)
    return sum

df['similarity'] = df.apply(lambda x: similarity(x.col1, x.col2), axis=1)

print(df)
8e2ybdfx

8e2ybdfx2#

您可以使用numpy.intersect1d来获取常用字,但第三行的%是不同的。

import numpy as np

df["Similarity_%"] = (
                        df.apply(lambda x: "%" + str(round(len(np.intersect1d(x['Col1'].split(), x['Col2'].split()))
                                                          / max(len(x["Col1"].split()), len(x["Col2"].split()))
                                                          *100, 2)), axis=1)
                     )
#输出:
print(df)
                                                                Col1                                  Col2 Similarity_%
0                                                    Debt collection                       Debt collection       %100.0
1                                                    Debt collection                              Mortgage         %0.0
2                                                Managing loan lease               Problems end loan lease        %50.0
3                                                Managing loan lease                   Struggling pay loan       %33.33
4  Credit reporting credit repair services personal consumer reports  Payday loan title loan personal loan        %12.5
#使用的输入:
import pandas as pd

df= pd.DataFrame({'Col1': ['Debt collection', 'Debt collection', 'Managing loan lease', 'Managing loan lease', 
                           'Credit reporting credit repair services personal consumer reports'],
                  'Col2': ['Debt collection', 'Mortgage', 'Problems end loan lease', 'Struggling pay loan',
                           'Payday loan title loan personal loan']})
#更新:

根据问题中的第二个给定 Dataframe ,您可以使用交叉连接(使用pandas.DataFrame.merge)将Product列的每一行与同一列的其余行进行比较。
试试看:

out = df[["Product"]].merge(df[["Product"]], how="cross", suffixes=("", "_cross"))

out["Similarity_%"] = (
                        out.apply(lambda x: "%" + str(round(len(np.intersect1d(x['Product'].split(), x['Product_cross'].split()))
                                                          / max(len(x["Product"].split()), len(x["Product_cross"].split()))
                                                          *100, 2)), axis=1)
                     )

对于10行的 Dataframe /列,结果将具有100行加上一个相似性列。

bf1o4zei

bf1o4zei3#

您可以尝试:

def countCommonWords( string1, string2):
  words1 = string1.lower().split()
  words2 = string2.lower().split()
  n = 0
  for word1 in words1:
    if word1 in words2: n+=1
  return n

请注意:
countCommonWords('a B c a','c b a')将返回3,
但是:
countCommonWords('c B a','a b c a')将返回4,这可能就是您的解决方案。
我们不知道您的搜索字符串是否包含重复的单词

相关问题