pytorch 标记器可以添加填充而不会出错，但数据整理器不能

b91juud3 于 2022-11-09 发布在其他

关注(0)|答案(1)|浏览(167)

我正在尝试使用来自HuggingFace的run_clm.py示例script对我的数据进行基于GPT2的模型微调。
我有一个.json数据文件，如下所示：

...
{"text": "some text"}
{"text": "more text"}
...

我不得不更改用于连接输入文本的脚本的默认行为，因为我的所有示例都是不应连接的单独演示：

def add_labels(example):
    example['labels'] = example['input_ids'].copy()
    return example

with training_args.main_process_first(desc="grouping texts together"):
    lm_datasets = tokenized_datasets.map(
        add_labels,
        batched=False,
        # batch_size=1,
        num_proc=data_args.preprocessing_num_workers,
        load_from_cache_file=not data_args.overwrite_cache,
        desc=f"Grouping texts in chunks of {block_size}",
    )

这实际上只添加了CLM所需的相应'labels'字段。
然而，由于GPT2有一个1024大小的上下文窗口，所以应该将示例填充到该长度。
我可以通过如下修改标记化过程来实现这一点：

def tokenize_function(examples):
    with CaptureLogger(tok_logger) as cl:
        output = tokenizer(
            examples[text_column_name], padding='max_length') # added: padding='max_length'
        # ...

培训运行正常。
但是，我认为这 * 不 * 应该由tokenizer来完成，而应该由数据整理器来完成。

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`labels` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

除此之外：

Traceback (most recent call last):
  File "/home/jan/repos/text2task/.venv/lib/python3.10/site-packages/transformers/tokenization_utils_base.py", line 716, in convert_to_tensors
    tensor = as_tensor(value)
ValueError: expected sequence of length 9 at dim 1 (got 33)

During handling of the above exception, another exception occurred:

为了解决这个问题，我创建了一个数据整理器，它应该做填充：

data_collator = DataCollatorWithPadding(tokenizer, padding='max_length')

这是传递给培训师的内容。但是，上面的错误仍然存在。
这是怎么回事？

pytorch

来源：https://stackoverflow.com/questions/74228361/tokenizer-can-add-padding-without-error-but-data-collator-cannot

1条答案

按热度按时间

wfveoks01#

我设法修复了错误，但我真的不确定我的解决方案，详细信息如下。将接受一个更好的答案。
这似乎解决了它：

data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, padding=True)

可在以下文档中找到：https://huggingface.co/docs/transformers/main_classes/data_collator#transformers.DataCollatorForSeq2Seq
看起来DataCollatorWithPadding没有填充标签？
我的问题是关于从输入序列生成输出序列，所以我 * 猜测 * 使用DataCollatorForSeq2Seq是我真正想要做的。然而，我的数据没有单独的input和target列，而是一个text列（包含字符串input => target）。我并不真的认为这个排序器是打算用于GPT2的...

赞(0）回复(0）举报 2022-11-09

我来回答

pytorch 标记器可以添加填充而不会出错，但数据整理器不能

1条答案

相关问题

热门标签

最新问答