nltk 在MT评估指标中的边缘案例

9vw9lbht  于 4个月前  发布在  其他
关注(0)|答案(6)|浏览(37)

nltk.translate中,MT评估指标仍然存在一些问题。其中大部分的BLEU相关问题已经在#1330中得到解决。但是在RIBES和CHRF中也出现了类似的问题:

  • ribes_score.py
  • https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L290 和 https://github.com/nltk/nltk/blob/develop/nltk/translate/ribes_score.py#L320 在可能的ngram对数为0时会出现ZeroDivisionError
  • chrf_score.py
  • 其他评分的参考接口默认支持多引用,而ChRF评分支持单引用。应该标准化以适应多引用
  • 但在多引用评分的情况下,ChRF没有指示要选择哪个参考,我们可能需要联系作者来了解如何处理这个问题。
4jb9z9bj

4jb9z9bj1#

在Maja Popovic的实现中,当提供多个引用时,使用导致最高f-score的引用,参见:https://github.com/m-popovic/chrF/blob/master/chrF%2B%2B.py#L155

q43xntqr

q43xntqr2#

似乎在计算bleu时存在一个错误,当不是所有句子都比最大n-gram长度短时。例如,以下测试用例应该得到1.0的bleu值,但实际上没有:

references = [['John loves Mary'.split()], ['John still loves Mary'.split()]]
hypothesis = ['John loves Mary'.split(), 'John still loves Mary'.split()]
n = 4  #
weights = [1.0 / n] * n  # Uniform weights.
print    (corpus_bleu(references, hypothesis, weights))

长度为3且与参考相同的句子被评分为有0个正确4-grams,而不是0个正确4-grams。
建议的补丁:

--- a/nltk/translate/bleu_score.py
+++ b/nltk/translate/bleu_score.py
@@ -183,6 +183,8 @@ def corpus_bleu(
         # denominator for the corpus-level modified precision.
         for i, _ in enumerate(weights, start=1):
             p_i = modified_precision(references, hypothesis, i)
+            if (p_i == None):
+                continue       # no ngrams because ref was shorter than i
             p_numerators[i] += p_i.numerator
             p_denominators[i] += p_i.denominator
 
@@ -240,6 +242,7 @@ def modified_precision(references, hypothesis, n):
     and denominator necessary to calculate the corpus-level precision.
     To calculate the modified precision for a single pair of hypothesis and
     references, cast the Fraction object into a float.
+    Returns None if references are shorter than n.
 
     The famous "the the the ... " example shows that you can get BLEU precision
     by duplicating high frequency words.
@@ -332,9 +335,10 @@ def modified_precision(references, hypothesis, n):
     }
 
     numerator = sum(clipped_counts.values())
-    # Ensures that denominator is minimum 1 to avoid ZeroDivisionError.
-    # Usually this happens when the ngram order is > len(reference).
-    denominator = max(1, sum(counts.values()))
+    denominator = sum(counts.values())
+    if denominator == 0:
+        # avoid div by zero when the ngram order is > len(reference)
+        return None
 
     return Fraction(numerator, denominator, _normalize=False)
eivnm1vs

eivnm1vs3#

在@bmaland之前,@bamattsson对#1844的贡献之前,NLTK的BLEU有一些技巧来确保精确字符串匹配给出1.0的结果,但在#1844之后,NLTK的BLEU分数与https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/multi-bleu.perl中可以找到的相似。

请注意,BLEU分数是语料库级别的度量,而不是句子级别的度量。这是一篇很好的论文,描述了相关问题:https://arxiv.org/pdf/1804.08771.pdf

wixjitnu

wixjitnu4#

如果你不确定列表中是否有短字符串,可以使用自动重新加权功能,例如:

>>> from nltk.translate import bleu
>>> references = ['John loves Mary'.split(), 'John still loves Mary'.split()]
>>> hypothesis = 'John loves Mary'.split()
>>> bleu(references, hypothesis)
/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/nltk/translate/bleu_score.py:523: UserWarning: 
The hypothesis contains 0 counts of 4-gram overlaps.
Therefore the BLEU score evaluates to 0, independently of
how many N-gram overlaps of lower order it contains.
Consider using lower n-gram order or use SmoothingFunction()
  warnings.warn(_msg)
1.2213386697554703e-77
>>> bleu(references, hypothesis, auto_reweigh=True)
1.0
dxxyhpgq

dxxyhpgq5#

在我上面的例子中,multi-bleu.pl给出了100.0的分数,但nltk给出了0.84的分数。这是一个例子,其中假设确实有一些匹配的4-grams,但参考中的每个句子的长度都不是4或更大。

nkcskrwz

nkcskrwz6#

这个是否仍然开放?

相关问题