训练XLMRobertaForSequenceClassification
时:
xlm_r_model(input_ids = X_train_batch_input_ids
, attention_mask = X_train_batch_attention_mask
, return_dict = False
)
我遇到了以下错误:
Traceback (most recent call last):
File "<string>", line 3, in <module>
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 1218, in forward
return_dict=return_dict,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 849, in forward
past_key_values_length=past_key_values_length,
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/transformers/models/roberta/modeling_roberta.py", line 132, in forward
inputs_embeds = self.word_embeddings(input_ids)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/sparse.py", line 160, in forward
self.norm_type, self.scale_grad_by_freq, self.sparse)
File "/usr/local/lib/python3.7/dist-packages/torch/nn/functional.py", line 2044, in embedding
return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
IndexError: index out of range in self
详情如下:
1.正在创建模型
config = XLMRobertaConfig()
config.output_hidden_states = False
xlm_r_model = XLMRobertaForSequenceClassification(config=config)
xlm_r_model.to(device) # device is device(type='cpu')
1.标记器
xlmr_tokenizer = XLMRobertaTokenizer.from_pretrained('xlm-roberta-large')
MAX_TWEET_LEN = 402
>>> df_1000.info() # describing a data frame I have pre populated
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 29639 to 44633
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 text 1000 non-null object
1 class 1000 non-null int64
dtypes: int64(1), object(1)
memory usage: 55.7+ KB
X_train = xlmr_tokenizer(list(df_1000[:800].text), padding=True, max_length=MAX_TWEET_LEN+5, truncation=True) # +5: a head room for special tokens / separators
>>> list(map(len,X_train['input_ids'])) # why its 105? shouldn't it be MAX_TWEET_LEN+5 = 407?
[105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, 105, ...]
>>> type(train_index) # describing (for clarity) training fold indices I pre populated
<class 'numpy.ndarray'>
>>> train_index.size
640
X_train_fold_input_ids = np.array(X_train['input_ids'])[train_index]
X_train_fold_attention_mask = np.array(X_train['attention_mask'])[train_index]
>>> i # batch id
0
>>> batch_size
16
X_train_batch_input_ids = X_train_fold_input_ids[i:i+batch_size]
X_train_batch_input_ids = torch.tensor(X_train_batch_input_ids,dtype=torch.long).to(device)
X_train_batch_attention_mask = X_train_fold_attention_mask[i:i+batch_size]
X_train_batch_attention_mask = torch.tensor(X_train_batch_attention_mask,dtype=torch.long).to(device)
>>> X_train_batch_input_ids.size()
torch.Size([16, 105]) # why 105? Shouldnt this be MAX_TWEET_LEN+5 = 407?
>>> X_train_batch_attention_mask.size()
torch.Size([16, 105]) # why 105? Shouldnt this be MAX_TWEET_LEN+5 = 407?
在此之后,我按照本问题开头的说明调用xlm_r_model(...)
,并以指定的错误结束。
注意到所有这些细节,我仍然不能得到为什么我会得到指定的错误。我在哪里做错了?
1条答案
按热度按时间xytpbqjk1#
根据github上的这篇帖子,可能有很多原因,下面是该帖子总结的原因列表(截至2022年4月24日,注意第二和第三个原因没有测试):
1.标记程序和bert模型的词汇大小不匹配。这将导致标记程序生成模型无法理解的ID。ref
1.不同设备(CPU、GPU、TPU)上存在的型号和数据参考
1.长度大于512的序列(这是类BERT模型的最大值)参考
在我的情况下,这是第一个原因,不匹配的vocab大小,我已经修复了这个如下:
下面是我如何解决这个问题: