regex 按句子拆分文本

ih99xse1  于 2023-10-22  发布在  其他
关注(0)|答案(2)|浏览(97)

我遇到了一个问题,找到一个舒适的方法来分割文本的列表预定义的句子。句子可以包括任何特殊字符和任何绝对习惯。
范例:

text = "My name. is A. His name is B. Her name is C. That's why..."
delims = ["My name. is", "His name is", "Her name is"]

我想要的是:

def custom_sentence_split(text, delims):
     # stuff
     return result

custom_sentence_split(text, delims)
# ["My name. is", "  A. ", "His name is", "  B. ", "Her name is", " C. That's why..."]

UPD。嗯,可能有像那样的舒适解决方案,我宁愿得到更舒适的解决方案

def collect_output(text, finds):
    text_copy = text[:]
    retn = []
    for found in finds:
        part1, part2 = text_copy.split(found, 1)
        retn += [part1, found]
        text_copy = part2
    return retn
    

def custom_sentence_split(text, splitters):
    pattern = "("+"|".join(splitters)+"|)"
    finds = list(filter(bool, re.findall(pattern, text)))
    output = collect_output(text, finds)
    return output

21012;,似乎找到了解决方案。

pattern = "("+"|".join(map(re.escape, delims)) +")"; 
re.split(pattern, text)
v1l68za4

v1l68za41#

你想使用re.split方法。
你需要一个像(My\sname\sis|His\sname\sis|Her\sname\sis)这样的正则表达式字符串
你可以像"("+"|".join(map(re.escape, delims))+")"那样构造正则表达式字符串
编辑:你可以这样做:

text = "My name is A. His name is B. Her name is C. That's why..."
delims = ["My name is", "His name is", "Her name is"]

import re

def custom_sentence_split(text,delims):
    pattern = "("+"|".join(map(re.escape, delims))+")"
    return re.split(pattern,text)

print(custom_sentence_split(text,delims))
gmxoilav

gmxoilav2#

text = "My name is A. His name is B. Her name is C. That's why..."

print([x.strip() for x in re.split(r'(.+?[A-Z]\.)', text) if x])

['My name is A.', 'His name is B.', 'Her name is C.', "That's why..."]

相关问题