regex 将段落拆分成句子的正则表达式

irtuqstp 于 2023-04-13 发布在其他

关注(0)|答案(3)|浏览(116)

我写了下面的正则表达式来匹配句子的结尾，并忽略了特殊情况，如美国，美国，序列号，专利号（它不应该在这里分裂，因为句子不在这里结束），但它没有给出正确的输出。这里是我的正则表达式：

((?:[A-Z]+\sU\.S\.)|(?:\w\.(\s?\w{1,12}\.)+)|(?:[A-Z]+\s+[A-Z]+\.\s+[A-Z]+)|(?:°\s?[cCfF]\.)|(?:\s+[\(]*[a-h0-9]{1}\))|(?:\.\s*Fig.{1,7}\.)|(?:([.?!;])\s*(?=[\`\’A-Za-z\(])))

以下是示例段落：
本申请是2017年4月14日提交的美国专利申请序列号15/731，069的继续申请，该美国专利申请序列号15/731，069是2016年1月21日提交的美国申请序列号14/998，574的继续申请，该美国申请序列号14/998，574是2016年1月21日提交的美国申请序列号14/198，695的继续申请。本申请要求于2014年3月6日提交的美国专利申请No. 9，286，457的优先权，其是于2011年1月31日提交的美国专利申请No. 12/931，340的部分继续申请，（现为美国专利No.8，842，887），其是2009年11月30日提交的美国专利申请序列号12/627，413的部分继续申请，（现为美国专利No.7，916，907），该申请是2005年6月14日提交的申请No. 11/151，412的继续申请，该申请现已被放弃。申请序列号12/931，340要求2010年11月15日提交的临时申请号61/456，901的权益，并且申请序列号11/151，412要求2010年6月14日提交的临时申请号60/579，422的权益。2004.测试我段落中的美国和美国的更多例子。检查我段落中的图3。
我尝试了我的正则表达式，但它没有给出预期的结果。
regex101演示

regex

来源：https://stackoverflow.com/questions/75975912/regex-for-splitting-paragraph-into-sentences

3条答案

按热度按时间

b4qexyjb1#

我的方法是使用第三个包，它支持负向后查找的无限宽度，以忽略所有特殊情况
regex包可能会有所帮助

(?<!U\.S|Ser|No|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|Pat|Fig)\.(?= [A-Z\d])

演示：regex101（注意，我在这里使用.NET引擎而不是Python，因为它支持无限lookbehind进行演示）

说明：

(?<!U\.S|Ser|No|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|Pat|Fig)负向后查找，忽略所有以这些特殊字开头的.。
\.匹配点.
(?= [A-Z\d])：向前看，确保后面有空格和大写字符或数字

安装regex包

pip instal regex

编程：

import regex
s = "This Application is a Continuation of U.S. patent application Ser. No. 15/731,069, filed on Apr. 14, 2017, which was a Continuation of U.S. application Ser. No. 14/998,574, filed on Jan. 21, 2016 which was a Continuation of U.S. application Ser. No. 14/198,695 (now U.S. Pat. No. 9,286,457) filed on Mar. 6, 2014, which was a Continuation in Part of U.S. patent application Ser. No. 12/931,340, filed on Jan. 31, 2011, (now U.S. Pat. No. 8,842,887) which was a Continuation in Part of U.S. patent application Ser. No. 12/627,413 filed on Nov. 30, 2009, (now U.S. Pat. No. 7,916,907) which was a continuation of application Ser. No. 11/151,412, filed on Jun. 14, 2005, now abandoned. application Ser. No. 14/198,695 claims the benefit of Provisional Application No. 61/851,884, filed on Mar. 15, 2013. Application Ser. No. 12/931,340 claims the benefit of Provisional Application No. 61/456,901, filed on Nov. 15, 2010 and application Ser. No. 11/151,412 claims the benefit of Provisional Application No. 60/579,422 filed on Jun. 14, 2004. Testing some more example of U.S.A and U.S in my paragraph. Checking Fig. 3. in my paragraph. 1 new sentences added to this text block."
lines = regex.split(r"(?<!U\.S|Ser|No|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|Pat|Fig)\.(?= [A-Z\d])", s)
for line in lines:
  print(line)

输出：

This Application is a Continuation of U.S. patent application Ser. No. 15/731,069, filed on Apr. 14, 2017, which was a Continuation of U.S. application Ser. No. 14/998,574, filed on Jan. 21, 2016 which was a Continuation of U.S. application Ser. No. 14/198,695 (now U.S. Pat. No. 9,286,457) filed on Mar. 6, 2014, which was a Continuation in Part of U.S. patent application Ser. No. 12/931,340, filed on Jan. 31, 2011, (now U.S. Pat. No. 8,842,887) which was a Continuation in Part of U.S. patent application Ser. No. 12/627,413 filed on Nov. 30, 2009, (now U.S. Pat. No. 7,916,907) which was a continuation of application Ser. No. 11/151,412, filed on Jun. 14, 2005, now abandoned. application Ser. No. 14/198,695 claims the benefit of Provisional Application No. 61/851,884, filed on Mar. 15, 2013
 Application Ser. No. 12/931,340 claims the benefit of Provisional Application No. 61/456,901, filed on Nov. 15, 2010 and application Ser. No. 11/151,412 claims the benefit of Provisional Application No. 60/579,422 filed on Jun. 14, 2004
 Testing some more example of U.S.A and U.S in my paragraph
 Checking Fig. 3. in my paragraph
 1 new sentences added to this text block.

**注意：**如果没有像我最后添加的句子1 new sentences added to this text block.那样以数字开头的行，则可以使用更简单的版本：

(?<!U\.S|Ser|No|Pat|Fig)\.(?= [A-Z])

赞(0）回复(0）举报 2023-04-13

50pmv0ei2#

编辑：因为在这种情况下，正则表达式可能会变得非常复杂，所以我会选择使用库。
使用节库

如果不需要正则表达式，一个选择是使用stanza：

import stanza

stanza.download('en')
# define pipeline
nlp = stanza.Pipeline(lang="en", processors='tokenize')
# create document
doc = nlp(text)
# extract sentences
doc_sentences = [sentence.text for sentence in doc.sentences]

for s in doc_sentences:
    print(s)

这给出了以下结果，即使在句子末尾使用缩写也能很好地工作（原始正则表达式无法处理）：

This Application is a Continuation of U.S. patent application Ser. No. 15/731,069, filed on Apr. 14, 2017, which was a Continuation of U.S. application Ser. No. 14/998,574, filed on Jan. 21, 2016 which was a Continuation of U.S. application Ser. No. 14/198,695 (now U.S. Pat. No. 9,286,457) filed on Mar. 6, 2014, which was a Continuation in Part of U.S. patent application Ser. No. 12/931,340, filed on Jan. 31, 2011, (now U.S. Pat. No. 8,842,887) which was a Continuation in Part of U.S. patent application Ser. No. 12/627,413 filed on Nov. 30, 2009, (now U.S. Pat. No. 7,916,907) which was a continuation of application Ser. No. 11/151,412, filed on Jun. 14, 2005, now abandoned.
application Ser. No. 14/198,695 claims the benefit of Provisional Application No. 61/851,884, filed on Mar. 15, 2013.
Application Ser. No. 12/931,340 claims the benefit of Provisional Application No. 61/456,901, filed on Nov. 15, 2010 and application Ser. No. 11/151,412 claims the benefit of Provisional Application No. 60/579,422 filed on Jun. 14, 2004.
Testing some more example of U.S.A and U.S in my paragraph.
Checking Fig. 3. in my paragraph about the U.S.
The check looks good.

注意：为了测试这一点，我稍微修改了你的输入文本，在句子的结尾有一个缩写，我添加了：Checking Fig. 3. in my paragraph about the U.S. The check looks good.*

如果正则表达式对您来说不是必需的，我建议您测试一下，因为用正则表达式覆盖所有边缘情况可能会很麻烦。

原始正则表达式方法

一种方法是更好地定义/简化“句子的结尾”是什么，以避免需要定义所有特殊情况。在您的示例中，这似乎是一个合理的简化：

只有在以下情况下，句子才会结束：一个点后跟一个空格，后面跟着一个以大写字母开头并且至少有4个字符长的单词。*

使用这种简化，您可以使用lookaheadAssert来匹配所有出现的“the end of a sentence”\.\s(?=[A-Z][a-zA-Z]{3,})，并使用此表达式来拆分使用re.split提供的文本，如下所示：

import re

text = "<your text>"

sentences = re.split(r"\.\s(?=[A-Z][a-zA-Z]{3,})", text)

print(sentences)

根据你的数据，这些是用这种方法得到的句子：

This Application is a Continuation of U.S. patent application Ser. No. 15/731,069, filed on Apr. 14, 2017, which was a Continuation of U.S. application Ser. No. 14/998,574, filed on Jan. 21, 2016 which was a Continuation of U.S. application Ser. No. 14/198,695 (now U.S. Pat. No. 9,286,457) filed on Mar. 6, 2014, which was a Continuation in Part of U.S. patent application Ser. No. 12/931,340, filed on Jan. 31, 2011, (now U.S. Pat. No. 8,842,887) which was a Continuation in Part of U.S. patent application Ser. No. 12/627,413 filed on Nov. 30, 2009, (now U.S. Pat. No. 7,916,907) which was a continuation of application Ser. No. 11/151,412, filed on Jun. 14, 2005, now abandoned. application Ser. No. 14/198,695 claims the benefit of Provisional Application No. 61/851,884, filed on Mar. 15, 2013
Application Ser. No. 12/931,340 claims the benefit of Provisional Application No. 61/456,901, filed on Nov. 15, 2010 and application Ser. No. 11/151,412 claims the benefit of Provisional Application No. 60/579,422 filed on Jun. 14, 2004
Testing some more example of U.S.A and U.S in my paragraph
Checking Fig. 3. in my paragraph.

赞(0）回复(0）举报 2023-04-13

kqqjbcuj3#

根据您希望分割位置的具体程度，您可能会使用re.sub的交替
然后在回调中检查模式中是否存在捕获组。如果捕获组存在，则在替换中使用它，后面跟着2个换行符（或者您想在它后面放置的内容）
在另一种情况下，只有匹配x.group()，它应该留在那里，所以你可以把它放在替换中。

(?<!\S)[A-Z](?:[a-z]+\.(?:\s+\d+\.)?|(?:\.[A-Z]+)*\.?)(?!\S)|([!.?])\s(?=[A-Z])

图案吻合

(?<!\S)向左声明空白边界
[A-Z]匹配字符A-Z
(?:备选项的非捕获组
[a-z]+\.匹配1+字符a-z和一个.
(?:\s+\d+\.)?可选匹配1+空格字符、1+数字和.
|或
(?:\.[A-Z]+)*\.?匹配.的可选重复，后跟A-Z和可选点
)关闭非捕获组
(?!\S)在右侧声明空白边界
|或
([!.?])\s捕获group 1，捕获!.?中的一个并匹配空白字符
(?=[A-Z])正向预测，直接在右侧Assert字符A-Z

请参见regex demo和Python demo

import re

pattern = r"(?<!\S)[A-Z](?:[a-z]+\.(?:\s+\d+\.)?|(?:\.[A-Z]+)*\.?)(?!\S)|([!.?])\s(?=[A-Z])"
s = "This Application is a Continuation of U.S. patent application Ser. No. 15/731,069, filed on Apr. 14, 2017, which was a Continuation of U.S. application Ser. No. 14/998,574, filed on Jan. 21, 2016 which was a Continuation of U.S. application Ser. No. 14/198,695 (now U.S. Pat. No. 9,286,457) filed on Mar. 6, 2014, which was a Continuation in Part of U.S. patent application Ser. No. 12/931,340, filed on Jan. 31, 2011, (now U.S. Pat. No. 8,842,887) which was a Continuation in Part of U.S. patent application Ser. No. 12/627,413 filed on Nov. 30, 2009, (now U.S. Pat. No. 7,916,907) which was a continuation of application Ser. No. 11/151,412, filed on Jun. 14, 2005, now abandoned. application Ser. No. 14/198,695 claims the benefit of Provisional Application No. 61/851,884, filed on Mar. 15, 2013. Application Ser. No. 12/931,340 claims the benefit of Provisional Application No. 61/456,901, filed on Nov. 15, 2010 and application Ser. No. 11/151,412 claims the benefit of Provisional Application No. 60/579,422 filed on Jun. 14, 2004. Testing some more example of U.S.A and U.S in my paragraph. Checking Fig. 3. in my paragraph."

result = re.sub(pattern, lambda x: x.group(1) + "\n\n" if x.group(1) else x.group(), s)
print(result)

输出

This Application is a Continuation of U.S. patent application Ser. No. 15/731,069, filed on Apr. 14, 2017, which was a Continuation of U.S. application Ser. No. 14/998,574, filed on Jan. 21, 2016 which was a Continuation of U.S. application Ser. No. 14/198,695 (now U.S. Pat. No. 9,286,457) filed on Mar. 6, 2014, which was a Continuation in Part of U.S. patent application Ser. No. 12/931,340, filed on Jan. 31, 2011, (now U.S. Pat. No. 8,842,887) which was a Continuation in Part of U.S. patent application Ser. No. 12/627,413 filed on Nov. 30, 2009, (now U.S. Pat. No. 7,916,907) which was a continuation of application Ser. No. 11/151,412, filed on Jun. 14, 2005, now abandoned. application Ser. No. 14/198,695 claims the benefit of Provisional Application No. 61/851,884, filed on Mar. 15, 2013.

Application Ser. No. 12/931,340 claims the benefit of Provisional Application No. 61/456,901, filed on Nov. 15, 2010 and application Ser. No. 11/151,412 claims the benefit of Provisional Application No. 60/579,422 filed on Jun. 14, 2004.

Testing some more example of U.S.A and U.S in my paragraph.

Checking Fig. 3. in my paragraph.

赞(0）回复(0）举报 2023-04-13

我来回答

regex 将段落拆分成句子的正则表达式

3条答案

相关问题

热门标签

最新问答