regex 识别文本中的句子

dgenwo3n  于 2023-10-22  发布在  其他
关注(0)|答案(4)|浏览(119)

我在正确识别文本中特定角落情况下的句子时遇到了一点麻烦:
1.如果涉及一个点,这将不会被保留。
1.如果涉及"
1.如果一个句子不小心以小写字母开头。
这就是我到目前为止在文本中识别句子的方法(来源:字幕Reformat to end with complete sentence):
re.findall部分基本上是查找一个以大写字母[A-Z]开头的str块,然后是除标点符号之外的任何内容,然后以标点符号[\.?!]结束。

import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
    for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
        print(sentence + "\n")
We were able to respond to the first research question.

Next, we also determined the size of the population.

角落案例1:点点点点
点,不被保留,因为没有指示如果三个点出现在一行中该怎么做。如何才能改变这种状况?

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")
We were able to respond to the first research question.

Next, we also determined the size of the population.

角落案例2:"
"符号成功地保留在句子中,但就像标点符号后面的点一样,它将在结尾被删除。

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")
We were able to respond to the first "research" question: "What is this?

Next, we also determined the size of the population.

角落案例3:句子开头小写
如果一个句子意外地以小写字母开头,该句子将被忽略。其目的是确定前一句已经结束(或文本刚刚开始),因此必须开始一个新的句子。

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first research question.

编辑

我测试了一下:

import spacy
from spacy.lang.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

但我得到的是

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-157-4fd093d3402b> in <module>()
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

<ipython-input-157-4fd093d3402b> in <listcomp>(.0)
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

doc.pyx in sents()

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:
nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the
dependency parser, or set sentence boundaries by setting
doc[i].is_sent_start.
zu0ti5jz

zu0ti5jz1#

你可以修改正则表达式来匹配你的极端情况。
首先,您不需要在[]中转义.
对于第一种极端情况,您可以使用[.!?]*来匹配ending-sentance-token
对于第二种情况,您可以在[.!?]之后匹配"
对于最后一个,您可以从upper或lower开始您的sentance:

import re

regex = r'([A-z][^.!?]*[.!?]*"?)'

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)

说明

  • [A-z],每一个匹配都应该以一个字母开头,可以是大写,也可以是小写。
  • [^.?!]*,它匹配greetings任何字符,这不是.?!(结束sentance字符)
  • [.?!]*,它与结尾字符匹配,因此...??!!???将作为句子的一部分进行匹配
  • "?,它最终匹配句子结尾处的引用

角落案例1:
我们能够回答第一个研究问题...接下来,我们还确定了人口的规模。
角落案例2:
我们能够回答第一个“研究”问题:“这是什么?“接下来,我们也确定了人口的规模。
角落案例3:
我们能够回答第一个研究问题。然后呢,我们还确定了人口的规模。

omqzjyyz

omqzjyyz2#

你可以使用一些工业软件包。例如,spacy有一个非常好的句子标记器。

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

您的场景:
1.案例结果-> ['We were able to respond to the first research question...', 'Next, we also determined the size of the population.']
1.案例结果-> ['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']
1.案例结果-> ['We were able to respond to the first research question.', 'next, we also determined the size of the population.']

ogq8wdun

ogq8wdun3#

试试这个正则表达式:([A-Z][^.!?]*[.!?]+["]?)
“+”表示一个或多个
'?'表示零或更多
这应该通过所有3个角落的情况下,你上面提到的

nx7onnlm

nx7onnlm4#

回答编辑的问题:
我认为你使用的代码是旧版本的spacy。对于Spacy3.0,您需要首先下载 en_core_web_sm 模型:

python -m spacy download en_core_web_sm

那么下面的解决方案应该可行:

raw_text = 'Hello, world. Here are two sentences.'
nlp = spacy.load("en_core_web_sm")
doc = nlp(raw_text)
sentences = [sent for sent in doc.sents]
print(sentences)

输出-
【世界你好,这里有两句话。】

相关问题