在python中解析字符串中单词组合的最快方法

rxztt3cl 于 2023-01-04 发布在 Python

关注(0)|答案(1)|浏览(128)

这是Python的新特性，我需要解析一个文本体，并检查dict中的元素是否存在于该文本体中。到目前为止，我已经使用了itertools：

from string import punctuation
from itertools import combinations

def retrieve_text_from_body(body, dict):
  a_list = []
  stripped_body = [i.strip(punctuation) for i in body.split()]
    for i in range(1, len(stripped_body)+1):
        for combination in combinations(stripped_body, i):
            if combination in dict:
                a_list.append(dict[combination])
  return a_list

输入和输出示例：

Body = "The big fat cat sits" (Could be up to 1000 words)
Dict = ["Big Fat", "Little Mouse", "Cat Sits"] (Could be any length)

Part of combinations that form:
['the', 'big']
['the', 'big', 'fat']
['the', 'big', 'fat', 'cat']
['the', 'big', 'fat', 'cat', 'sits']
['big', 'fat']
['big', 'fat', 'cat']
['big', 'fat', 'cat', 'sits']

Output: ["Big Fat", "Cat Sits"]

上面的代码在我的用例中非常慢，因为我必须在一个表中的数百万行上执行它。我想知道是否有更快的方法？

python

来源：https://stackoverflow.com/questions/74991822/fastest-way-to-parse-combinations-of-words-in-a-string-in-python

1条答案

按热度按时间

xytpbqjk1#

使用正则表达式，你可以用一种简单的方法解决这个问题，正则表达式确实非常慢，因为它们需要多次访问字符串的所有位置。

import regex as re

def search_in_body(body, structures):
    found_structures = [
        f for f in structures if re.search(f, body, re.IGNORECASE)
    ]
    return found_structures

if __name__ == "__main__":
    Body = "The big fat cat sits"
    Dict = ["Big Fat", "Little Mouse", "Cat Sits"] 
    print(search_in_body(Body, Dict))

>>> ['Big Fat', 'Cat Sits']

赞(0）回复(0）举报 2023-01-04

我来回答

在python中解析字符串中单词组合的最快方法

1条答案

相关问题

热门标签

最新问答