regex 按句子拆分文本

ih99xse1 于 2023-10-22 发布在其他

关注(0)|答案(2)|浏览(97)

我遇到了一个问题，找到一个舒适的方法来分割文本的列表预定义的句子。句子可以包括任何特殊字符和任何绝对习惯。
范例：

text = "My name. is A. His name is B. Her name is C. That's why..."
delims = ["My name. is", "His name is", "Her name is"]

我想要的是：

def custom_sentence_split(text, delims):
     # stuff
     return result

custom_sentence_split(text, delims)
# ["My name. is", "  A. ", "His name is", "  B. ", "Her name is", " C. That's why..."]

UPD。嗯，可能有像那样的非舒适解决方案，我宁愿得到更舒适的解决方案

def collect_output(text, finds):
    text_copy = text[:]
    retn = []
    for found in finds:
        part1, part2 = text_copy.split(found, 1)
        retn += [part1, found]
        text_copy = part2
    return retn
    

def custom_sentence_split(text, splitters):
    pattern = "("+"|".join(splitters)+"|)"
    finds = list(filter(bool, re.findall(pattern, text)))
    output = collect_output(text, finds)
    return output

21012;，似乎找到了解决方案。

pattern = "("+"|".join(map(re.escape, delims)) +")"; 
re.split(pattern, text)

regex

来源：https://stackoverflow.com/questions/77133470/split-text-by-sentences

2条答案

按热度按时间

v1l68za41#

你想使用re.split方法。
你需要一个像(My\sname\sis|His\sname\sis|Her\sname\sis)这样的正则表达式字符串
你可以像"("+"|".join(map(re.escape, delims))+")"那样构造正则表达式字符串
编辑：你可以这样做：

text = "My name is A. His name is B. Her name is C. That's why..."
delims = ["My name is", "His name is", "Her name is"]

import re

def custom_sentence_split(text,delims):
    pattern = "("+"|".join(map(re.escape, delims))+")"
    return re.split(pattern,text)

print(custom_sentence_split(text,delims))

赞(0）回复(0）举报 2023-10-22

gmxoilav2#

text = "My name is A. His name is B. Her name is C. That's why..."

print([x.strip() for x in re.split(r'(.+?[A-Z]\.)', text) if x])

['My name is A.', 'His name is B.', 'Her name is C.', "That's why..."]

赞(0）回复(0）举报 2023-10-22

我来回答

regex 按句子拆分文本

21012;，似乎找到了解决方案。

2条答案

相关问题

热门标签

最新问答