I have a Python list of string names where I would like to remove a common substring from all of the names.
And after reading this similar answer I could almost achieve the desired result using SequenceMatcher
.
But only when all items have a common substring:
From List:
string 1 = myKey_apples
string 2 = myKey_appleses
string 3 = myKey_oranges
common substring = "myKey_"
To List:
string 1 = apples
string 2 = appleses
string 3 = oranges
However I have a slightly noisy list that contains a few scattered items that don't fit the same naming convention.
I would like to remove the "most common" substring from the majority:
From List:
string 1 = myKey_apples
string 2 = myKey_appleses
string 3 = myKey_oranges
string 4 = foo
string 5 = myKey_Banannas
common substring = ""
To List:
string 1 = apples
string 2 = appleses
string 3 = oranges
string 4 = foo
string 5 = Banannas
I need a way to match the "myKey_" substring so I can remove it from all names.
But when I use the SequenceMatcher
the item "foo" causes the "longest match" to be equal to blank "".
I think the only way to solve this is to find the "most common substring". But how could that be accomplished?
Basic example code:
from difflib import SequenceMatcher
names = ["myKey_apples",
"myKey_appleses",
"myKey_oranges",
#"foo",
"myKey_Banannas"]
string2 = names[0]
for i in range(1, len(names)):
string1 = string2
string2 = names[i]
match = SequenceMatcher(None, string1, string2).find_longest_match(0, len(string1), 0, len(string2))
print(string1[match.a: match.a + match.size]) # -> myKey_
4条答案
按热度按时间p1iqtdky1#
给定
names = ["myKey_apples", "myKey_appleses", "myKey_oranges", "foo", "myKey_Banannas"]
我能想到的一个
O(n^2)
解决方案是找到所有可能的子字符串,并将它们与它们出现的次数一起存储在字典中:然后选择出现次数最多的子串
cyvaqqii2#
以下是您的问题的详细解决方案:
odopli943#
我会先找出出现次数最多的起始字母,然后取每个有该起始字母的单词,并在所有这些单词都有匹配字母的情况下取,最后去掉每个起始单词的前缀:
出局:[“苹果”、“苹果”、“橙子”、“浆果”]
5rgfhyps4#
从python-string列表中查找most-common-substring
我已经在python-3.10.5上测试过了,我希望它能为你工作。我有相同的用例,但任务不同,我只需要从超过100s的文件列表中找到一个common-pattern-string。要作为regular-expression使用。
你的Basic示例代码在我的例子中不起作用。因为第一个检查第二个,第二个检查第三个,第三个检查第四个,等等。所以,我把它改为最常见的子字符串,并将检查每一个。
这段代码的缺点是,如果最常见的子字符串中有不常见的地方,那么最后一个最常见的子字符串将是空的。
pythonpython-3python-difflib