CoreNLP boundaryMultiTokenRegex: 正则表达式识别错误的组

bq9c1y66 于 6个月前发布在其他

关注(0)|答案(4)|浏览(65)

你好，我正在尝试将一个没有适当标点符号的句子列表拆分成单独的句子。看起来，用 boundaryMultiTokenRegex 编写的表达式没有按预期工作。关键是，在所有情况下，我不能通过换行符将整个文本拆分，因为整个文本可能是多行句子。但是列表项有一个开始的关键字指示器。

@Test
    public void testTokenizeNLsInList() {
        String text = "This is a list:\n" +
                "- String one\n" +
                "- String two;\n" +
                "- String three";

        Properties props = PropertiesUtils.asProperties(
                "annotators", "tokenize, ssplit",
                "tokenize.options", "tokenizeNLs",
                "ssplit.boundaryMultiTokenRegex", "(/\\n|\\*NL\\*/) /[^[\\p{Alnum}'\"`!?.,]]/ /\\p{Lu}\\p{L}+/"
        );

        StanfordCoreNLP pipeline = new StanfordCoreNLP(props);

        Annotation document1 = new Annotation(text);
        pipeline.annotate(document1);
        List<CoreMap> sentences = document1.get(CoreAnnotations.SentencesAnnotation.class);
        assertEquals(4, sentences.size());

        // make sure that there are the correct # of tokens
        // (does NOT contain NL tokens)
        List<CoreLabel> tokens = document1.get(CoreAnnotations.TokensAnnotation.class);
        assertEquals(4, tokens.size());
    }

预期行为是得到4个句子：

This is a list:
- String one
- String two;
- String three

实际行为：

This is a list:\n- String,
one\n- String
two;\n- String
three

额外发现：
我检查了在 https://github.com/stanfordnlp/CoreNLP/blob/master/src/edu/stanford/nlp/process/WordToSentenceProcessor.java#L282 中发生了什么。结果是，我得到了3个标记结果，而不是模式中提到的第一个。

List<? super IN> nodes = matcher.groupNodes();
        if (nodes != null && ! nodes.isEmpty()) {
          if (DEBUG) { log.info("    found match at: " + nodes); }
          isSentenceBoundary.put(nodes.get(nodes.size() - 1), true);
        }

此外，在调试器中，我看到 matcher 有两个组找到了模式中的所有标记，并且只有正确的第一个标记。但是由于正确的组是第二个，最终结果是错误的。

CoreNLP

来源：https://github.com/stanfordnlp/CoreNLP/issues/1078

4条答案

按热度按时间

zzzyeukh1#

根据我所了解的，您试图做的并不是boundaryMultiTokenRegex的预期用途。文档中写道：“匹配到的标记将被视为下一句话的一部分。”我的理解是，这意味着匹配到的标记将是第一句话的一部分。换句话说，您正在看到的行为正是预期的行为。您可能需要为WordToSentenceProcessor添加功能，包括前瞻标记正则表达式。或者，您可以在ssplit和其他注解器之间添加一个新的注解器，以便根据需要重新排列句子。尽管我可能遗漏了一些内容，但我目前看不到有任何功能可以完全满足您的需求。

赞(0）回复(0）举报 6个月前

oiopk7p52#

好的，谢谢。在这种情况下，我肯定会考虑如何解决这个问题。

我考虑了前瞻功能。但我的担忧是提到的片段：

List<? super IN> nodes = matcher.groupNodes();
        if (nodes != null && ! nodes.isEmpty()) {
          if (DEBUG) { log.info("    found match at: " + nodes); }
          isSentenceBoundary.put(nodes.get(nodes.size() - 1), true);
        }

不确定使用 matcher.groupNodes() 是否正确。https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#cg 说： Group zero always stands for the entire expression 。
对正则表达式中分组的常见理解是像这样：

java.util.regex.Pattern pattern = java.util.regex.Pattern.compile("(b) c d");
java.util.regex.Matcher matcher = pattern.matcher("a b c d e");
java.util.regex.MatchResult result = matcher.toMatchResult();
if (matcher.find()) {
    System.out.println(matcher.group(1));
}

其中 b 是唯一的结果，因为 (b) 只被提到作为一个分组。
另请参阅： https://rubular.com/r/PGDnmY3IcAua0j
因此，在这种对匹配组的理解下，不需要前瞻令牌 regex 操作。

赞(0）回复(0）举报 6个月前

cedebl8k3#

一种简单的方法来保持与之前的表达式完全兼容 - 在需要换行的地方添加一个特定的组名。例如，"ssplit.boundaryMultiTokenRegex", "(?$LN_BREAK /\\n|\\*NL\\*/) /[^[\\p{Alnum}'\"!?.,]]/ /\\p{Lu}\\p{L}+/",其中$LN_BREAK是在发生换行之后的最后一个移除组。如果没有这样一个名为该名称的组，就保持现有的逻辑不变。
关于这个建议有什么看法？

赞(0）回复(0）举报 6个月前

kdfy810k4#

我正在考虑添加一个新字段会比尝试向现有的boundaryMultiTokenRegex添加更多功能要容易得多。可能有两个新字段，这样我们就可以指定哪些片段分配给上一个句子，哪些分配给下一个句子。

赞(0）回复(0）举报 6个月前