pytorch Huggingface标记分类管道提供与直接调用model()不同的输出

t5zmwmid 于 2023-03-12 发布在其他

关注(0)|答案(1)|浏览(232)

我正在尝试使用a roberta based model来屏蔽文本中的命名实体。建议使用模型的方法是通过Huggingface管道，但我发现这样使用它相当慢。在文本数据上使用管道也阻止了我使用GPU进行计算，因为文本不能放到GPU上。
因此，我决定把模型放在GPU上，自己标记文本（使用传递给管道的相同标记器），把标记放在GPU上，然后传递给模型。这样做是可行的，但是直接使用模型的输出和不通过管道有很大的不同。我找不到原因，也找不到解决方法。
我试着阅读令牌分类管道source code，但与管道的作用相比，我找不到我的用法有什么不同。

产生不同结果的代码示例：

1.型号卡中的建议用法：

ner_tokenizer = AutoTokenizer.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
model = AutoModelForTokenClassification.from_pretrained("xlm-roberta-large-finetuned-conll03-english")
classifier = pipeline("ner", model=model, tokenizer=ner_tokenizer, framework='pt')
out = classifier(dataset['text'])

'out'现在是字典对象列表的列表，这些字典对象保存字符串列表'dataset['text']'中给定字符串中每个命名实体的信息。
1.我的自定义用法：

text_batch = dataset['text']
encodings_batch = ner_tokenizer(text_batch,padding="max_length", truncation=True, max_length=128, return_tensors="pt")
input_ids = encodings_batch['input_ids']
input_ids = input_ids.to(TORCH_DEVICE)
outputs = model(input_ids)[0]
outputs = outputs.to('cpu')
label_ner_ids = outputs.argmax(dim=2).to('cpu')

'label_ner_ids'现在是2维的Tensor，其元素表示给定文本行中每个标记的标签，因此label_ner_id[i，j]是字符串列表'text_batch'中第i个文本字符串中第j个标记的标签。此处的标记标签与管道使用的输出不同。

pytorch

来源：https://stackoverflow.com/questions/75240203/huggingface-token-classification-pipeline-giving-different-outputs-than-just-cal

1条答案

按热度按时间

2g32fytz1#

pipeline支持GPU上的处理，您需要做的只是传递一个设备：

from transformers import pipeline

model_id = "xlm-roberta-large-finetuned-conll03-english"

classifier = pipeline("ner", model=model_id, device=TORCH_DEVICE, framework='pt')
out = classifier(dataset['text'])

赞(0）回复(0）举报 2023-03-12

我来回答

pytorch Huggingface标记分类管道提供与直接调用model()不同的输出

产生不同结果的代码示例：

1条答案

相关问题

热门标签

最新问答