在python中解析字符串中单词组合的最快方法

rxztt3cl  于 2023-01-04  发布在  Python
关注(0)|答案(1)|浏览(128)

这是Python的新特性,我需要解析一个文本体,并检查dict中的元素是否存在于该文本体中。到目前为止,我已经使用了itertools:

from string import punctuation
from itertools import combinations

def retrieve_text_from_body(body, dict):
  a_list = []
  stripped_body = [i.strip(punctuation) for i in body.split()]
    for i in range(1, len(stripped_body)+1):
        for combination in combinations(stripped_body, i):
            if combination in dict:
                a_list.append(dict[combination])
  return a_list

输入和输出示例:

Body = "The big fat cat sits" (Could be up to 1000 words)
Dict = ["Big Fat", "Little Mouse", "Cat Sits"] (Could be any length)

Part of combinations that form:
['the', 'big']
['the', 'big', 'fat']
['the', 'big', 'fat', 'cat']
['the', 'big', 'fat', 'cat', 'sits']
['big', 'fat']
['big', 'fat', 'cat']
['big', 'fat', 'cat', 'sits']

Output: ["Big Fat", "Cat Sits"]

上面的代码在我的用例中非常慢,因为我必须在一个表中的数百万行上执行它。我想知道是否有更快的方法?

xytpbqjk

xytpbqjk1#

使用正则表达式,你可以用一种简单的方法解决这个问题,正则表达式确实非常慢,因为它们需要多次访问字符串的所有位置。

import regex as re

def search_in_body(body, structures):
    found_structures = [
        f for f in structures if re.search(f, body, re.IGNORECASE)
    ]
    return found_structures

if __name__ == "__main__":
    Body = "The big fat cat sits"
    Dict = ["Big Fat", "Little Mouse", "Cat Sits"] 
    print(search_in_body(Body, Dict))
>>> ['Big Fat', 'Cat Sits']

相关问题