spaCy 在使用ThreadPoolExecutor时,相同输入的NER预测结果不一致,

ar7v8xwq  于 5个月前  发布在  其他
关注(0)|答案(1)|浏览(52)

当通过en_core_web_trf模型并行运行数据时,我在每次运行之间得到了不同的结果。
我无法在文档或其他GitHub问题中找到这种行为被解释的地方。
以下代码重现了这种行为:如果我不通过管道并行运行数据(例如设置max_workers=1),我发现结果始终是一致的。

import spacy
from concurrent.futures import ThreadPoolExecutor

nlp = spacy.load("en_core_web_trf")

def extract_entities(sentences):
    with ThreadPoolExecutor(max_workers=4) as e:
        submitted = [e.submit(call_spacy, sent) for sent in sentences]
        resolved = [item.result() for item in submitted]

        return resolved

def call_spacy(sent):
    result = nlp(sent)
    return result.ents

input =[
	"CoCo Town also known as the Collective Commerce District or more simply as the Coco District was a dilapidated industrial area of the planet Coruscant.",
	"It was also the site of Dexs Diner a local eatery owned by Dexter Jettster during the Republic Era.",
	"Hard working laborers visited CoCo Town to congregate at the diner.",
	"During the Galactic Civil War the Galactic Empire and the New Republic fought for control of the region.",
	"Many orphans from the area formed the Anklebiter Brigade and fought alongside the rebels sabotaging the Empire wherever possible."
]

for i in range(10):
    result = extract_entities(input)
    print(sum([len(x) for x in result]))

您的环境

  • 操作系统:Amazon Linux 2 内核:Linux 4.14.294-220.533.amzn2.x86_64
  • 使用的Python版本:python 3.7.10
  • 使用的spaCy版本:3.1.3
  • 环境信息:en-core-web-trf==3.1.0
zpgglvta

zpgglvta1#

我可以复现这个问题,但它可能与 torch 有关,而不是直接与 spacy 有关,我不确定 torch 中可能发生什么导致这个问题。我们来看看!

我们建议作为尝试的第一个替代方案是使用内置的多进程处理 nlp.pipe:

import spacy
import torch

torch.set_num_threads(1)

nlp = spacy.load("en_core_web_trf")

input =[
        "CoCo Town also known as the Collective Commerce District or more simply as the Coco District was a dilapidated industrial area of the planet Coruscant.",
        "It was also the site of Dexs Diner a local eatery owned by Dexter Jettster during the Republic Era.",
        "Hard working laborers visited CoCo Town to congregate at the diner.",
        "During the Galactic Civil War the Galactic Empire and the New Republic fought for control of the region.",
        "Many orphans from the area formed the Anklebiter Brigade and fought alongside the rebels sabotaging the Empire wherever possible."
]

for i in range(10):
    print(sum(len(doc.ents) for doc in nlp.pipe(input, n_process=4)))

注意事项:

相关问题