如果其中一个是另一个的前缀,则regex findall overlapped不给予匹配

a14dhokn  于 2023-03-04  发布在  其他
关注(0)|答案(1)|浏览(109)
import regex

product_detail = "yyy target1 target2 xxx".lower()
p1 = r"\btarget1\b|\btarget1 target2\b"
p2 = r"\btarget2\b|\btarget1 target2\b"
for pattern in [p1, p2]:
    matches = regex.findall(pattern, product_detail, overlapped=True)
    print(matches)

为什么来自p1的匹配只给予['target1']作为输出,而不提供'target1 target2'
但是来自P2的匹配可以成功地给予['target1 target2', 'target2']作为输出。
另外,如果你能提供一个修复,我如何概括它?我有一个10000个目标单词的列表,它不会是可行的硬编码。

x4shl7ld

x4shl7ld1#

下面是一个例子,说明了我对构建一个分隔公共前缀的模式列表的看法:

import regex  # I'm actually using re (don't have regex)

product_detail = "yyy target1 target2 xxx".lower()

keywords = ["target1","target2","target1 target2","target3"]

from itertools import accumulate, groupby, zip_longest

keywords.sort()
groups   = accumulate(keywords,lambda g,k:g if k.startswith(g) else k)
patterns = ( g for _,(*g,) in groupby(keywords,lambda _:next(groups)) )
patterns = ( filter(None,g) for g in zip_longest(*patterns) )   
patterns = [r"\b" + r"\b|\b".join(g) + r"\b" for g in patterns]

# [r'\btarget1\b|\btarget2\b|\btarget3\b', r'\btarget1 target2\b']

for pattern in patterns:
    matches = regex.findall(pattern, product_detail)
    print(matches)

输出:

['target1', 'target2']
['target1 target2']

相关问题