nltk 解析特定句子时,EarleyChartParser和InsideChartParser可能出现的问题,

uujelgoq  于 6个月前  发布在  其他
关注(0)|答案(3)|浏览(48)

在尝试从Treebank语料库解析特定句子时,这两个解析器表现得非常奇怪。
这是来自treebank的句子:
['A', 'form', 'of', 'asbestos', 'once', 'used', '*', '*', 'to', 'make', 'Kent', 'cigarette', 'filters', 'has', 'caused', 'a', 'high', 'percentage', 'of', 'cancer', 'deaths', 'among', 'a', 'group', 'of', 'workers', 'exposed', '*', 'to', 'it', 'more', 'than', '30', 'years', 'ago', ',', 'researchers', 'reported', '0', '*T*-1', '.']
EarleyChartParser在使用简单的CFG语法时,会产生一个包含24284棵树的巨大森林来找到正确的那棵树。
然而InsideChartParser似乎在一个队列排序的过程中陷入了无限循环,这是它内部的一个数据结构。
对于我使用的PCFG解析器,我使用这个函数:

def pchart(parser, sentence, gold_tree): #here the parser is a InsideChartParser
    test_trees = list(parser.parse(sentence))
    print("PCHART PARSER - TREES FOUND: ", len(test_trees))
    best_prob = 0.0
    for idx, test_tree in enumerate(test_trees):
        print("TREE: %d" %idx)
        print(test_tree)
        curr_prob = test_tree.prob()
        if curr_prob > best_prob:
            best_prob = curr_prob
            best_tree = test_tree
        if test_tree.productions() == gold_tree.productions(): #check if the tree iterated now is the correct one
            print("CORRECT TREE")
        else:
            print("WRONG TREE")
    return best_tree

而对于早期解析器,我使用这个:

def earley(parser, sentence, gold_tree): #here the parser is a EarleyChartParser
    test_trees = list(parser.parse(sentence)) #creates a forest of trees, with every trees being able to parse that sentence
    print("EARLEY PARSER - TREES FOUND: ", len(test_trees))
    for idx, test_tree in enumerate(test_trees):
        print("TREE: %d" %idx)
        print(test_tree)
        if test_tree.productions() == gold_tree.productions():
            print("CORRECT TREE")
        else:
            print("WRONG TREE")`

在枚举了这么多树之后,Early停止了,而InsideChartParser开始了,但陷入了一个非常长的循环,我认为这是由于此时树或产生式的数量巨大,而且没有很快终止:

<ipython-input-9-641b936c1306> in <module>
     24 
     25 if __name__ == '__main__':
---> 26     main()

<ipython-input-9-641b936c1306> in main()
     18     gold_tree = treebank.parsed_sents()[3]
     19     earley(cfg_earley_parser, sentence, gold_tree) #run earley parser and compare it with gold tree and shows the various trees
---> 20     tree = pchart(pcfg_pchart_parser, sentence, gold_tree) #shows the possible trees and returns the one with maximum probability
     21     print("BEST TREE WITH PROBABILITY: %.12e" %tree.prob())
     22     tree.draw()

<ipython-input-8-5e94e1748d29> in pchart(parser, sentence, gold_tree)
      1 def pchart(parser, sentence, gold_tree):
----> 2     test_trees = list(parser.parse(sentence))
      3     print("PCHART PARSER - TREES FOUND: ", len(test_trees))
      4 
      5     best_prob = 0.0

F:\Development\anaconda3\lib\site-packages\nltk\parse\pchart.py in parse(self, tokens)
    244         while len(queue) > 0:
    245             # Re-sort the queue.
--> 246             self.sort_queue(queue, chart)
    247 
    248             # Prune the queue to the correct size if a beam was defined

F:\Development\anaconda3\lib\site-packages\nltk\parse\pchart.py in sort_queue(self, queue, chart)
    359         :rtype: None
    360         """
--> 361         queue.sort(key=lambda edge: edge.prob())
    362 
    363 

F:\Development\anaconda3\lib\site-packages\nltk\parse\pchart.py in <lambda>(edge)
    359         :rtype: None
    360         """
--> 361         queue.sort(key=lambda edge: edge.prob())
    362 
    363

对于这样一个简单的语法,这是正确的行为吗?
因为这种情况并没有发生在我从treebank语料库中使用的其他句子上。
感谢您的时间!

nuypyhwy

nuypyhwy1#

由于不知道传递给EarlyChartParserInsideChartParser的语法,我无法正确复现这个问题。

pdsfdshx

pdsfdshx2#

Without knowing the grammar passed to EarlyChartParser and InsideChartParser , I am unable to reproduce this properly.
Sorry I missed that, here you go. I used this to generate directly grammar and the two parsers.

def generate_grammar_and_parsers(parsed_sents):
    #put every productions of each parsed_sents in the list (with repeatitions)
    tbank_productions_with_repeat = [production for parsed_sent in parsed_sents for production in parsed_sent.productions()]
    #the same but no repeatitions
    tbank_productions = set(tbank_productions_with_repeat)
    
    print("Number of unique productions: ", len(tbank_productions))
    
    print("Building CFG")
    cfg = CFG(Nonterminal('S'), tbank_productions) #generate a CFG for parsing and set S as starting point
    print(cfg, end="\n\n")
    
    cfg_earley_parser = EarleyChartParser(cfg, trace=0) #create parser using Early algorithm on this CFG, trace = 0 no verbosity
    
    print("Building PCFG")
    pcfg = induce_pcfg(Nonterminal('S'), tbank_productions_with_repeat) #build a PCFG using the repeatitions list
    print(pcfg, end="\n\n")
    
    pcfg_pchart_parser = InsideChartParser(pcfg) #bottom-up parser algorithm on this PCFG
    
    return cfg_earley_parser, pcfg_pchart_parser

To this function I just passed as argument the first 10 sentences of treebank.parsed_sents()

qxgroojn

qxgroojn3#

我最近没有太多时间,但今天花了一些时间调查了一下。我对这些图表解析器不太熟悉,但是InsideChartParser挂在一个while len(queue) > 0:上,而循环的主体只扩展队列(至少在一段时间内)。这个循环的主体考虑了具有概率指向8.106e-33ProbabilisticLeafEdge对象,在将队列从41扩展到约38000条边之后。我认为应该有一些阈值,使得概率低于某个值的边的扩展不再进行,但这似乎并不是情况。
然而,InsideChartParser确实允许一个参数beam_size来限制队列的大小为那个数字。我将其设置为1000,它找到了16棵树(尽管它们都被你的程序认为是错误的)。使用1100时,我得到了44棵(错误的)树,使用1500时,我得到了54棵(错误的)树。
EarlyChartParser似乎在生成树方面更快。然而,它确实生成了很多树。这是否是一个实际的错误还是预期的行为,我无法确定。

相关问题