Spacy在分割句子时存在不一致性,

h43kikqp 于 5个月前发布在其他

关注(0)|答案(5)|浏览(80)

你好，
我正在使用Spacy将一组单词用空格连接后分割句子。但是令我失望的是，这个过程具有不可预测和难以解释的行为。我有一个自定义的分割函数，其中我试图设置自定义的句子边界(即is_sent_start)。
自定义函数：

from spacy.language import Language

@Language.component("segm")
def set_custom_segmentation(doc):
    i = 0
    while i < len(doc[:-1]):
        if doc[i].text.lower() in ["eq", "fig", "al", 'table', "fig."]:
            doc[i+1].is_sent_start = False
            i+=1
        elif doc[i].text in ["(", "'s"]:
            doc[i].is_sent_start = False
            i+=1
        elif doc[i].text in [".", ")."]:
            doc[i+1].is_sent_start = True
        else:
            doc[i+1].is_sent_start = False
        i+=1
    return doc

nlp = spacy.load('en_core_web_sm')
nlp.add_pipe("segm", before="parser")
nlp.pipeline

这是我的nlp.pipeline。

[('tok2vec', <spacy.pipeline.tok2vec.Tok2Vec at 0x29f4c3ee0>),
 ('tagger', <spacy.pipeline.tagger.Tagger at 0x29f4c3f40>),
 ('segm', <function __main__.set_custom_segmentation(doc)>),
 ('parser', <spacy.pipeline.dep_parser.DependencyParser at 0x29f8380b0>),
 ('attribute_ruler',
  <spacy.pipeline.attributeruler.AttributeRuler at 0x29e3ee4c0>),
 ('lemmatizer', <spacy.lang.en.lemmatizer.EnglishLemmatizer at 0x29dd1f100>),
 ('ner', <spacy.pipeline.ner.EntityRecognizer at 0x29f7f7f40>)]

如何重现行为

doc = nlp("Massive ETGs are summarized in a schematic way in Fig. 2 . ##(this is the sentence to consider)## We refer the reader to fig. 1 of Forbes et al. ( 2011 ) and fig. 10 of Faifer et al. ( 2011 ) for real-world examples of our schematic plot, which show not only the mean gradients but also the individual GC data points. Figure 2.")

for sent in doc.sents:
    print(sent)

这是当前的输出：

这里的tokens Fig. 2 .的形式为句子产生了不同的输出。请参阅以下示例。

如果我们更改* (将2.更改为21.) *

如果我们更改* (删除21和句号之间的空格) *

如果我们更改* (删除2和句号之间的空格) *

如果我们更改* (将2更改为1) *

如果我们更改* (将2更改为3) *

如果我们更改* (将2更改为4) *

如果我们更改* (将2更改为4并删除空格) *

如果我们更改* (将2更改为400并删除空格) *

如果我们更改* (将2更改为400) *

在这里，句子边界的分类方式存在不一致的行为。我还有许多其他示例，如果需要，我可以在这里分享它们。
了解这一点的任何帮助都将不胜感激。

你的环境

spaCy版本： 3.6.0
平台： macOS-14.3.1-arm64-arm-64bit
Python版本： 3.10.12
Pipelines: en_core_web_lg (3.6.0), en_core_web_sm (3.6.0)

spacy

来源：https://github.com/explosion/spaCy/issues/13346

5条答案

按热度按时间

wz1wpwve1#

你可能会遇到的问题是依赖解析器负责在预训练的spaCy管道中查找和设置句子边界：
https://spacy.io/api/dependencyparser#assigned-attributes
出于这个原因，如果你有自己的管道设置边界，你可能希望在此之后运行此管道。你能尝试看看这是否对你有帮助吗？

赞(0）回复(0）举报 5个月前

1u4esq0p2#

你好@danieldk,

感谢你对这个问题的回复。我尝试了你的建议，将自定义分割函数移动到了nlp.pipeline中的parser之后。但是我遇到了一个错误。

我认为这不会起作用，因为这里的解析过程会干扰到我需要设置的自定义分割边界，这些边界是由于某些边缘情况(如Fig.、eg.等)而产生的。

在这里我还看到了另一个问题：#3569。类似的问题。

赞(0）回复(0）举报 5个月前

uelo1irk3#

我遇到了一个稍微不同但相似的问题
spacy == 3.7.4, mac

In [69]: len(list(spacy.load("en_core_web_trf")("The first sentence. The second sentence. The last one").sents))
Out[69]: 1   <<<<<<<<<<<<<<<<<<<<<< WRONG

In [70]: len(list(spacy.load("en_core_web_trf")("The first sentence. The second sentence. The last one.").sents))
Out[70]: 3

In [71]: len(list(spacy.load("en_core_web_sm")("The first sentence. The second sentence. The last one").sents))
Out[71]: 3

In [72]: len(list(spacy.load("en_core_web_sm")("The first sentence. The second sentence. The last one.").sents))
Out[72]: 3

赞(0）回复(0）举报 5个月前

lvjbypge4#

我遇到了一个有点不同但相似的问题。
这是一个不同的问题。你能在讨论论坛上开个主题吗？

赞(0）回复(0）举报 5个月前

nnt7mjpx5#

感谢您对这个问题的回复。我尝试了您的建议，将自定义分割函数放在nlp.pipeline中的解析器之后。但是我遇到了一个错误。
啊，对了，抱歉，我忽略了这一点。在解析后更改边界的问题是，这可能导致跨越句子边界的依赖关系，这就是我们不允许这样做的原因之一。我们必须对此进行更深入的研究，因为解析器原则上应该尊重之前设置的边界。同时，请参阅
#11107
#7716
以获取更多背景信息。

赞(0）回复(0）举报 5个月前

我来回答

Spacy在分割句子时存在不一致性,

如何重现行为

你的环境

5条答案

相关问题

热门标签

最新问答