regex 将段落拆分成句子的正则表达式

irtuqstp  于 2023-04-13  发布在  其他
关注(0)|答案(3)|浏览(116)

我写了下面的正则表达式来匹配句子的结尾,并忽略了特殊情况,如美国,美国,序列号,专利号(它不应该在这里分裂,因为句子不在这里结束),但它没有给出正确的输出。这里是我的正则表达式:

((?:[A-Z]+\sU\.S\.)|(?:\w\.(\s?\w{1,12}\.)+)|(?:[A-Z]+\s+[A-Z]+\.\s+[A-Z]+)|(?:°\s?[cCfF]\.)|(?:\s+[\(]*[a-h0-9]{1}\))|(?:\.\s*Fig.{1,7}\.)|(?:([.?!;])\s*(?=[\`\’A-Za-z\(])))

以下是示例段落:
本申请是2017年4月14日提交的美国专利申请序列号15/731,069的继续申请,该美国专利申请序列号15/731,069是2016年1月21日提交的美国申请序列号14/998,574的继续申请,该美国申请序列号14/998,574是2016年1月21日提交的美国申请序列号14/198,695的继续申请。本申请要求于2014年3月6日提交的美国专利申请No. 9,286,457的优先权,其是于2011年1月31日提交的美国专利申请No. 12/931,340的部分继续申请,(现为美国专利No.8,842,887),其是2009年11月30日提交的美国专利申请序列号12/627,413的部分继续申请,(现为美国专利No.7,916,907),该申请是2005年6月14日提交的申请No. 11/151,412的继续申请,该申请现已被放弃。申请序列号12/931,340要求2010年11月15日提交的临时申请号61/456,901的权益,并且申请序列号11/151,412要求2010年6月14日提交的临时申请号60/579,422的权益。2004.测试我段落中的美国和美国的更多例子。检查我段落中的图3。
我尝试了我的正则表达式,但它没有给出预期的结果。
regex101演示

b4qexyjb

b4qexyjb1#

我的方法是使用第三个包,它支持负向后查找的无限宽度,以忽略所有特殊情况
regex包可能会有所帮助

(?<!U\.S|Ser|No|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|Pat|Fig)\.(?= [A-Z\d])

演示:regex101(注意,我在这里使用.NET引擎而不是Python,因为它支持无限lookbehind进行演示)

说明

  • (?<!U\.S|Ser|No|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|Pat|Fig)负向后查找,忽略所有以这些特殊字开头的.
  • \.匹配点.
  • (?= [A-Z\d]):向前看,确保后面有空格和大写字符或数字

安装regex

pip instal regex

编程:

import regex
s = "This Application is a Continuation of U.S. patent application Ser. No. 15/731,069, filed on Apr. 14, 2017, which was a Continuation of U.S. application Ser. No. 14/998,574, filed on Jan. 21, 2016 which was a Continuation of U.S. application Ser. No. 14/198,695 (now U.S. Pat. No. 9,286,457) filed on Mar. 6, 2014, which was a Continuation in Part of U.S. patent application Ser. No. 12/931,340, filed on Jan. 31, 2011, (now U.S. Pat. No. 8,842,887) which was a Continuation in Part of U.S. patent application Ser. No. 12/627,413 filed on Nov. 30, 2009, (now U.S. Pat. No. 7,916,907) which was a continuation of application Ser. No. 11/151,412, filed on Jun. 14, 2005, now abandoned. application Ser. No. 14/198,695 claims the benefit of Provisional Application No. 61/851,884, filed on Mar. 15, 2013. Application Ser. No. 12/931,340 claims the benefit of Provisional Application No. 61/456,901, filed on Nov. 15, 2010 and application Ser. No. 11/151,412 claims the benefit of Provisional Application No. 60/579,422 filed on Jun. 14, 2004. Testing some more example of U.S.A and U.S in my paragraph. Checking Fig. 3. in my paragraph. 1 new sentences added to this text block."
lines = regex.split(r"(?<!U\.S|Ser|No|Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|Pat|Fig)\.(?= [A-Z\d])", s)
for line in lines:
  print(line)

输出:

This Application is a Continuation of U.S. patent application Ser. No. 15/731,069, filed on Apr. 14, 2017, which was a Continuation of U.S. application Ser. No. 14/998,574, filed on Jan. 21, 2016 which was a Continuation of U.S. application Ser. No. 14/198,695 (now U.S. Pat. No. 9,286,457) filed on Mar. 6, 2014, which was a Continuation in Part of U.S. patent application Ser. No. 12/931,340, filed on Jan. 31, 2011, (now U.S. Pat. No. 8,842,887) which was a Continuation in Part of U.S. patent application Ser. No. 12/627,413 filed on Nov. 30, 2009, (now U.S. Pat. No. 7,916,907) which was a continuation of application Ser. No. 11/151,412, filed on Jun. 14, 2005, now abandoned. application Ser. No. 14/198,695 claims the benefit of Provisional Application No. 61/851,884, filed on Mar. 15, 2013
 Application Ser. No. 12/931,340 claims the benefit of Provisional Application No. 61/456,901, filed on Nov. 15, 2010 and application Ser. No. 11/151,412 claims the benefit of Provisional Application No. 60/579,422 filed on Jun. 14, 2004
 Testing some more example of U.S.A and U.S in my paragraph
 Checking Fig. 3. in my paragraph
 1 new sentences added to this text block.

**注意:**如果没有像我最后添加的句子1 new sentences added to this text block.那样以数字开头的行,则可以使用更简单的版本:

(?<!U\.S|Ser|No|Pat|Fig)\.(?= [A-Z])
50pmv0ei

50pmv0ei2#

  • 编辑:因为在这种情况下,正则表达式可能会变得非常复杂,所以我会选择使用库。
    使用节库

如果不需要正则表达式,一个选择是使用stanza

import stanza

stanza.download('en')
# define pipeline
nlp = stanza.Pipeline(lang="en", processors='tokenize')
# create document
doc = nlp(text)
# extract sentences
doc_sentences = [sentence.text for sentence in doc.sentences]

for s in doc_sentences:
    print(s)

这给出了以下结果,即使在句子末尾使用缩写也能很好地工作(原始正则表达式无法处理):

This Application is a Continuation of U.S. patent application Ser. No. 15/731,069, filed on Apr. 14, 2017, which was a Continuation of U.S. application Ser. No. 14/998,574, filed on Jan. 21, 2016 which was a Continuation of U.S. application Ser. No. 14/198,695 (now U.S. Pat. No. 9,286,457) filed on Mar. 6, 2014, which was a Continuation in Part of U.S. patent application Ser. No. 12/931,340, filed on Jan. 31, 2011, (now U.S. Pat. No. 8,842,887) which was a Continuation in Part of U.S. patent application Ser. No. 12/627,413 filed on Nov. 30, 2009, (now U.S. Pat. No. 7,916,907) which was a continuation of application Ser. No. 11/151,412, filed on Jun. 14, 2005, now abandoned.
application Ser. No. 14/198,695 claims the benefit of Provisional Application No. 61/851,884, filed on Mar. 15, 2013.
Application Ser. No. 12/931,340 claims the benefit of Provisional Application No. 61/456,901, filed on Nov. 15, 2010 and application Ser. No. 11/151,412 claims the benefit of Provisional Application No. 60/579,422 filed on Jun. 14, 2004.
Testing some more example of U.S.A and U.S in my paragraph.
Checking Fig. 3. in my paragraph about the U.S.
The check looks good.
  • 注意:为了测试这一点,我稍微修改了你的输入文本,在句子的结尾有一个缩写,我添加了:Checking Fig. 3. in my paragraph about the U.S. The check looks good.*

如果正则表达式对您来说不是必需的,我建议您测试一下,因为用正则表达式覆盖所有边缘情况可能会很麻烦。

原始正则表达式方法

一种方法是更好地定义/简化“句子的结尾”是什么,以避免需要定义所有特殊情况。在您的示例中,这似乎是一个合理的简化:

  • 只有在以下情况下,句子才会结束:一个点后跟一个空格,后面跟着一个以大写字母开头并且至少有4个字符长的单词。*

使用这种简化,您可以使用lookaheadAssert来匹配所有出现的“the end of a sentence”\.\s(?=[A-Z][a-zA-Z]{3,}),并使用此表达式来拆分使用re.split提供的文本,如下所示:

import re

text = "<your text>"

sentences = re.split(r"\.\s(?=[A-Z][a-zA-Z]{3,})", text)

print(sentences)

根据你的数据,这些是用这种方法得到的句子:

This Application is a Continuation of U.S. patent application Ser. No. 15/731,069, filed on Apr. 14, 2017, which was a Continuation of U.S. application Ser. No. 14/998,574, filed on Jan. 21, 2016 which was a Continuation of U.S. application Ser. No. 14/198,695 (now U.S. Pat. No. 9,286,457) filed on Mar. 6, 2014, which was a Continuation in Part of U.S. patent application Ser. No. 12/931,340, filed on Jan. 31, 2011, (now U.S. Pat. No. 8,842,887) which was a Continuation in Part of U.S. patent application Ser. No. 12/627,413 filed on Nov. 30, 2009, (now U.S. Pat. No. 7,916,907) which was a continuation of application Ser. No. 11/151,412, filed on Jun. 14, 2005, now abandoned. application Ser. No. 14/198,695 claims the benefit of Provisional Application No. 61/851,884, filed on Mar. 15, 2013
Application Ser. No. 12/931,340 claims the benefit of Provisional Application No. 61/456,901, filed on Nov. 15, 2010 and application Ser. No. 11/151,412 claims the benefit of Provisional Application No. 60/579,422 filed on Jun. 14, 2004
Testing some more example of U.S.A and U.S in my paragraph
Checking Fig. 3. in my paragraph.
kqqjbcuj

kqqjbcuj3#

根据您希望分割位置的具体程度,您可能会使用re.sub的交替
然后在回调中检查模式中是否存在捕获组。如果捕获组存在,则在替换中使用它,后面跟着2个换行符(或者您想在它后面放置的内容)
在另一种情况下,只有匹配x.group(),它应该留在那里,所以你可以把它放在替换中。

(?<!\S)[A-Z](?:[a-z]+\.(?:\s+\d+\.)?|(?:\.[A-Z]+)*\.?)(?!\S)|([!.?])\s(?=[A-Z])

图案吻合

  • (?<!\S)向左声明空白边界
  • [A-Z]匹配字符A-Z
  • (?:备选项的非捕获组
  • [a-z]+\.匹配1+字符a-z和一个.
  • (?:\s+\d+\.)?可选匹配1+空格字符、1+数字和.
  • |
  • (?:\.[A-Z]+)*\.?匹配.的可选重复,后跟A-Z和可选点
  • )关闭非捕获组
  • (?!\S)在右侧声明空白边界
  • |
  • ([!.?])\s捕获group 1,捕获!.?中的一个并匹配空白字符
  • (?=[A-Z])正向预测,直接在右侧Assert字符A-Z

请参见regex demoPython demo

import re

pattern = r"(?<!\S)[A-Z](?:[a-z]+\.(?:\s+\d+\.)?|(?:\.[A-Z]+)*\.?)(?!\S)|([!.?])\s(?=[A-Z])"
s = "This Application is a Continuation of U.S. patent application Ser. No. 15/731,069, filed on Apr. 14, 2017, which was a Continuation of U.S. application Ser. No. 14/998,574, filed on Jan. 21, 2016 which was a Continuation of U.S. application Ser. No. 14/198,695 (now U.S. Pat. No. 9,286,457) filed on Mar. 6, 2014, which was a Continuation in Part of U.S. patent application Ser. No. 12/931,340, filed on Jan. 31, 2011, (now U.S. Pat. No. 8,842,887) which was a Continuation in Part of U.S. patent application Ser. No. 12/627,413 filed on Nov. 30, 2009, (now U.S. Pat. No. 7,916,907) which was a continuation of application Ser. No. 11/151,412, filed on Jun. 14, 2005, now abandoned. application Ser. No. 14/198,695 claims the benefit of Provisional Application No. 61/851,884, filed on Mar. 15, 2013. Application Ser. No. 12/931,340 claims the benefit of Provisional Application No. 61/456,901, filed on Nov. 15, 2010 and application Ser. No. 11/151,412 claims the benefit of Provisional Application No. 60/579,422 filed on Jun. 14, 2004. Testing some more example of U.S.A and U.S in my paragraph. Checking Fig. 3. in my paragraph."

result = re.sub(pattern, lambda x: x.group(1) + "\n\n" if x.group(1) else x.group(), s)
print(result)

输出

This Application is a Continuation of U.S. patent application Ser. No. 15/731,069, filed on Apr. 14, 2017, which was a Continuation of U.S. application Ser. No. 14/998,574, filed on Jan. 21, 2016 which was a Continuation of U.S. application Ser. No. 14/198,695 (now U.S. Pat. No. 9,286,457) filed on Mar. 6, 2014, which was a Continuation in Part of U.S. patent application Ser. No. 12/931,340, filed on Jan. 31, 2011, (now U.S. Pat. No. 8,842,887) which was a Continuation in Part of U.S. patent application Ser. No. 12/627,413 filed on Nov. 30, 2009, (now U.S. Pat. No. 7,916,907) which was a continuation of application Ser. No. 11/151,412, filed on Jun. 14, 2005, now abandoned. application Ser. No. 14/198,695 claims the benefit of Provisional Application No. 61/851,884, filed on Mar. 15, 2013.

Application Ser. No. 12/931,340 claims the benefit of Provisional Application No. 61/456,901, filed on Nov. 15, 2010 and application Ser. No. 11/151,412 claims the benefit of Provisional Application No. 60/579,422 filed on Jun. 14, 2004.

Testing some more example of U.S.A and U.S in my paragraph.

Checking Fig. 3. in my paragraph.

相关问题