regex 识别文本中的句子

dgenwo3n 于 2023-10-22 发布在其他

关注(0)|答案(4)|浏览(122)

我在正确识别文本中特定角落情况下的句子时遇到了一点麻烦：
1.如果涉及一个点，这将不会被保留。
1.如果涉及"。
1.如果一个句子不小心以小写字母开头。
这就是我到目前为止在文本中识别句子的方法（来源：字幕Reformat to end with complete sentence）：
re.findall部分基本上是查找一个以大写字母[A-Z]开头的str块，然后是除标点符号之外的任何内容，然后以标点符号[\.?!]结束。

import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
    for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
        print(sentence + "\n")

We were able to respond to the first research question.

Next, we also determined the size of the population.

角落案例1：点点点点
点，不被保留，因为没有指示如果三个点出现在一行中该怎么做。如何才能改变这种状况？

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first research question.

Next, we also determined the size of the population.

角落案例2："
"符号成功地保留在句子中，但就像标点符号后面的点一样，它将在结尾被删除。

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first "research" question: "What is this?

Next, we also determined the size of the population.

角落案例3：句子开头小写
如果一个句子意外地以小写字母开头，该句子将被忽略。其目的是确定前一句已经结束（或文本刚刚开始），因此必须开始一个新的句子。

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
    print(sentence + "\n")

We were able to respond to the first research question.

编辑

我测试了一下：

import spacy
from spacy.lang.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

但我得到的是

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-157-4fd093d3402b> in <module>()
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

<ipython-input-157-4fd093d3402b> in <listcomp>(.0)
      6 nlp = English()
      7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]

doc.pyx in sents()

ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:
nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the
dependency parser, or set sentence boundaries by setting
doc[i].is_sent_start.

regex

来源：https://stackoverflow.com/questions/56181993/identify-sentences-in-text

4条答案

按热度按时间

zu0ti5jz1#

你可以修改正则表达式来匹配你的极端情况。
首先，您不需要在[]中转义.
对于第一种极端情况，您可以使用[.!?]*来匹配ending-sentance-token
对于第二种情况，您可以在[.!?]之后匹配"
对于最后一个，您可以从upper或lower开始您的sentance：

import re

regex = r'([A-z][^.!?]*[.!?]*"?)'

text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)
print()

text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(regex, text):
    print(sentence)

说明

[A-z]，每一个匹配都应该以一个字母开头，可以是大写，也可以是小写。
[^.?!]*，它匹配greetings任何字符，这不是.，?或!（结束sentance字符）
[.?!]*，它与结尾字符匹配，因此...??!!???将作为句子的一部分进行匹配
"?，它最终匹配句子结尾处的引用

角落案例1：
我们能够回答第一个研究问题...接下来，我们还确定了人口的规模。
角落案例2：
我们能够回答第一个“研究”问题：“这是什么？“接下来，我们也确定了人口的规模。
角落案例3：
我们能够回答第一个研究问题。然后呢，我们还确定了人口的规模。

赞(0）回复(0）举报 2023-10-22

omqzjyyz2#

你可以使用一些工业软件包。例如，spacy有一个非常好的句子标记器。

from __future__ import unicode_literals, print_function
from spacy.en import English

raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]

您的场景：
1.案例结果-> ['We were able to respond to the first research question...', 'Next, we also determined the size of the population.']
1.案例结果-> ['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']
1.案例结果-> ['We were able to respond to the first research question.', 'next, we also determined the size of the population.']

赞(0）回复(0）举报 2023-10-22

ogq8wdun3#

试试这个正则表达式：（[A-Z][^.！？]*[.！？]+["]？）
“+”表示一个或多个
'？'表示零或更多
这应该通过所有3个角落的情况下，你上面提到的

赞(0）回复(0）举报 2023-10-22

nx7onnlm4#

回答编辑的问题：
我认为你使用的代码是旧版本的spacy。对于Spacy3.0，您需要首先下载 en_core_web_sm 模型：

python -m spacy download en_core_web_sm

那么下面的解决方案应该可行：

raw_text = 'Hello, world. Here are two sentences.'
nlp = spacy.load("en_core_web_sm")
doc = nlp(raw_text)
sentences = [sent for sent in doc.sents]
print(sentences)

输出-
【世界你好，这里有两句话。】

赞(0）回复(0）举报 2023-10-22

我来回答

regex 识别文本中的句子

编辑

4条答案

说明

相关问题

热门标签

最新问答