我在正确识别文本中特定角落情况下的句子时遇到了一点麻烦:
1.如果涉及一个点,这将不会被保留。
1.如果涉及"
。
1.如果一个句子不小心以小写字母开头。
这就是我到目前为止在文本中识别句子的方法(来源:字幕Reformat to end with complete sentence):re.findall
部分基本上是查找一个以大写字母[A-Z]
开头的str
块,然后是除标点符号之外的任何内容,然后以标点符号[\.?!]
结束。
import re
text = "We were able to respond to the first research question. Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first research question.
Next, we also determined the size of the population.
角落案例1:点点点点
点,不被保留,因为没有指示如果三个点出现在一行中该怎么做。如何才能改变这种状况?
text = "We were able to respond to the first research question... Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first research question.
Next, we also determined the size of the population.
角落案例2:""
符号成功地保留在句子中,但就像标点符号后面的点一样,它将在结尾被删除。
text = "We were able to respond to the first \"research\" question: \"What is this?\" Next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first "research" question: "What is this?
Next, we also determined the size of the population.
角落案例3:句子开头小写
如果一个句子意外地以小写字母开头,该句子将被忽略。其目的是确定前一句已经结束(或文本刚刚开始),因此必须开始一个新的句子。
text = "We were able to respond to the first research question. next, we also determined the size of the population."
for sentence in re.findall(r'([A-Z][^\.!?]*[\.!?])', text):
print(sentence + "\n")
We were able to respond to the first research question.
编辑
我测试了一下:
import spacy
from spacy.lang.en import English
raw_text = 'Hello, world. Here are two sentences.'
nlp = English()
doc = nlp(raw_text)
sentences = [sent.string.strip() for sent in doc.sents]
但我得到的是
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-157-4fd093d3402b> in <module>()
6 nlp = English()
7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]
<ipython-input-157-4fd093d3402b> in <listcomp>(.0)
6 nlp = English()
7 doc = nlp(raw_text)
----> 8 sentences = [sent.string.strip() for sent in doc.sents]
doc.pyx in sents()
ValueError: [E030] Sentence boundaries unset. You can add the 'sentencizer' component to the pipeline with:
nlp.add_pipe(nlp.create_pipe('sentencizer')) Alternatively, add the
dependency parser, or set sentence boundaries by setting
doc[i].is_sent_start.
4条答案
按热度按时间zu0ti5jz1#
你可以修改正则表达式来匹配你的极端情况。
首先,您不需要在
[]
中转义.
对于第一种极端情况,您可以使用
[.!?]*
来匹配ending-sentance-token对于第二种情况,您可以在
[.!?]
之后匹配"
对于最后一个,您可以从upper或lower开始您的sentance:
说明
[A-z]
,每一个匹配都应该以一个字母开头,可以是大写,也可以是小写。[^.?!]*
,它匹配greetings任何字符,这不是.
,?
或!
(结束sentance字符)[.?!]*
,它与结尾字符匹配,因此...??!!???
将作为句子的一部分进行匹配"?
,它最终匹配句子结尾处的引用角落案例1:
我们能够回答第一个研究问题...接下来,我们还确定了人口的规模。
角落案例2:
我们能够回答第一个“研究”问题:“这是什么?“接下来,我们也确定了人口的规模。
角落案例3:
我们能够回答第一个研究问题。然后呢,我们还确定了人口的规模。
omqzjyyz2#
你可以使用一些工业软件包。例如,spacy有一个非常好的句子标记器。
您的场景:
1.案例结果->
['We were able to respond to the first research question...', 'Next, we also determined the size of the population.']
1.案例结果->
['We were able to respond to the first "research" question: "What is this?"', 'Next, we also determined the size of the population.']
1.案例结果->
['We were able to respond to the first research question.', 'next, we also determined the size of the population.']
ogq8wdun3#
试试这个正则表达式:([A-Z][^.!?]*[.!?]+["]?)
“+”表示一个或多个
'?'表示零或更多
这应该通过所有3个角落的情况下,你上面提到的
nx7onnlm4#
回答编辑的问题:
我认为你使用的代码是旧版本的spacy。对于Spacy3.0,您需要首先下载 en_core_web_sm 模型:
那么下面的解决方案应该可行:
输出-
【世界你好,这里有两句话。】