CoreNLP Tokenizer splitHyphenated regression

pgpifvop 于 6个月前发布在其他

关注(0)|答案(9)|浏览(72)

以下代码片段在3.9.2版本中似乎正确地在"year-end"中的连字符处进行了拆分，但在4.4.0版本中不再如此。这是预期的行为吗？

public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
}

旧输出： [year, -, end]
新输出： [year-end]

CoreNLP

来源：https://github.com/stanfordnlp/CoreNLP/issues/1289

9条答案

按热度按时间

nzrxty8p1#

我在这里没有看到任何问题。

import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;

import java.util.*;
import java.util.stream.*;

public class foo {
  public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
  }
}

java foo
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[year, -, end]

赞(0）回复(0）举报 6个月前

i34xakig2#

如果通过git checkout v4.4.0使用v4.4.0,或者在git clone中使用v4.5.0,就会发生这种情况。

赞(0）回复(0）举报 6个月前

vql8enpb3#

嗯，这很奇怪。可能是图书馆的干扰？我已经尽力隔离错误，但仍然得到它：

# lib/main has all of our classpath entries
$ find lib/main -name "*.jar" | grep stanford 
lib/main/edu.stanford.nlp_stanford-corenlp_4.4.0.jar

$ unzip -p lib/main/edu.stanford.nlp_stanford-corenlp_4.4.0.jar META-INF/MANIFEST.MF
Manifest-Version: 1.0
Implementation-Version: 4.4.0
Built-Date: 2022-01-20
Created-By: Stanford JavaNLP (jebolton)
Main-class: edu.stanford.nlp.pipeline.StanfordCoreNLP

$ cat foo.java                                                                      
import edu.stanford.nlp.pipeline.Annotation;
import edu.stanford.nlp.pipeline.StanfordCoreNLP;

import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.ling.CoreLabel;

import java.util.*;
import java.util.stream.*;

public class foo {
  public static void main(String[] args) {
    String text = "year-end";
    Properties props = new Properties();
    props.setProperty("annotators", "tokenize,ssplit");
    props.setProperty("tokenize.language", "en");
    props.setProperty("tokenize.options", "splitHyphenated=true,invertible,ptb3Escaping=true");
    StanfordCoreNLP pipeline = new StanfordCoreNLP(props);
    Annotation ann = new Annotation(text);
    pipeline.annotate(ann);
    List<CoreLabel> tokens = ann.get(CoreAnnotations.TokensAnnotation.class);
    System.out.println(tokens.stream().map(CoreLabel::originalText).collect(Collectors.toList()));
  }
}

$ "$JAVA_HOME/bin/javac" foo.java
OpenJDK 64-Bit Server VM warning: .hotspot_compiler file is present but has been ignored.  Run with -XX:CompileCommandFile=.hotspot_compiler to load the file.

$ "$JAVA_HOME/bin/java" foo      
OpenJDK 64-Bit Server VM warning: .hotspot_compiler file is present but has been ignored.  Run with -XX:CompileCommandFile=.hotspot_compiler to load the file.
SLF4J: No SLF4J providers were found.
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#noProviders for further details.
[year-end]

也许这是一个Antlr版本问题？我们有Antlr Runtime 4.7.2

赞(0）回复(0）举报 6个月前

oiopk7p54#

更进一步：显然，如果我从选项中移除 ptb3Escaping=true ,那么它将按预期工作。我将深入研究 Lexer ,但看起来 ptb3Escaping 对连字符有自己的观点，并且关于谁的观点更重要存在一些顺序不确定性。

赞(0）回复(0）举报 6个月前

ymzxtsji5#

Blarg,反编译器在PTBLexer上出现了问题，无法让我设置断点，但我有很好的证据证明确实是这样。
请考虑以下代码块，它位于PTBLexer.flex构造函数的开头：

Properties prop = StringUtils.stringToProperties(options);
        Set<Map.Entry<Object,Object>> props = prop.entrySet();
        for (Map.Entry<Object,Object> item : props) {
          String key = (String) item.getKey();
          String value = (String) item.getValue();
          boolean val = Boolean.parseBoolean(value);
          if ("".equals(key)) {
            // allow an empty item
//...
          } else if ("ptb3Escaping".equals(key)) {
//...
            splitHyphenated = ! val;
//...
          } else if ("ud".equals(key)) {
//...
            splitHyphenated=val;
//...
          } else if ("splitHyphenated".equals(key)) {
            splitHyphenated = val;
          }

如果我通过props.entrySet().iterator().next()检查props(幸运的是，StringUtils仍然可以反编译),我得到的是splitHyphenated -> true,这表明ptb3Escaping在属性集后面出现，从而覆盖了splitHyphenated的值。
ptb3Escaping和splitHyphenated是否真的不兼容，还是这是意外的？

赞(0）回复(0）举报 6个月前

yrefmtwq6#

嗯，这可能会变得很糟糕。我尝试了几个不同的Java 8安装，并在两者中都得到了期望的行为，但在Java 11和Java 14的安装上，我得到了和你一样的错误。你正在运行哪个Java版本？也许字符串哈希函数在不同版本之间发生了变化，因此键以不同的顺序进行迭代？在这种情况下，我想最简单的解决办法是让后面的键以确定的顺序覆盖前面的键。

赞(0）回复(0）举报 6个月前

3df52oht7#

我现在确定，是属性对象中的关键字顺序导致了这个问题。
在我们找到某种解决方法的同时，你可以随时将词法分析器的 splitHyphenated 属性设置为你需要的任何值...

赞(0）回复(0）举报 6个月前

ybzsozfc8#

那么，在多大程度上这是一个需要快速修复的问题，而不是能够绕过它(例如，在创建后通过在词法分析器中设置适当的选项)直到下一个版本发布的问题？

赞(0）回复(0）举报 6个月前

i7uaboj49#

分词器的修复现已在dev分支中。我也想在解析器中修复这个问题，但这需要再次序列化所有模型。请在此期间保持开放！

赞(0）回复(0）举报 6个月前