你好,团队。
我在使用Ludwig在NVIDIA A 100示例上进行LLM的微调时遇到了问题。我收到了错误信息:Encounted nan
values in tensor. Will be removed.(遇到了Tensor中的nan
值。将被移除。)我的损失和困惑度显示为NaN。
模型类型:LLM
基础模型:elyza/ELYZA-japanese-Llama-2-7b-instruct
input_features:
- name: instruction
type: text
output_features:
- name: output
type: text
preprocessing:
split_probabilities: [0.8, 0.1, 0.1]
prompt:
template: >-
Below is an instruction that describes a task, paired with an input
that provides further context. Write a response that appropriately
completes the request.
### Instruction: {instruction}
### Input: {input}
### Response:
generation:
temperature: 0.01
max_new_tokens: 512
adapter:
type: lora
quantization:
bits: 4
trainer:
type: finetune
use_gpu: True
epochs: 1
batch_size: 8
eval_batch_size: 8
gradient_accumulation_steps: 1
learning_rate: 0.001
optimizer:
type: adam
params:
eps: 1.e-8
betas:
- 0.9
- 0.999
weight_decay: 0
learning_rate_scheduler:
warmup_fraction: 0.03
reduce_on_plateau: 0
"""`
{ "evaluation_frequency": { "frequency": 1, "period": "epoch" }, "test": { "combined": { "loss": [ NaN ] }, "output": { "char_error_rate": [ 1.0 ], "loss": [ NaN ], "next_token_perplexity": [ NaN ], "perplexity": [ NaN ], "sequence_accuracy": [ 0.0 ], "token_accuracy": [ 0.0 ] } }, "training": { "combined": { "loss": [ 1.7828550338745117 ] }, "output": { "char_error_rate": [ 0.9905372858047485 ], "loss": [ 1.7828550338745117 ], "next_token_perplexity": [ 16787.67578125 ], "perplexity": [ NaN ], "sequence_accuracy": [ 0.0 ], "token_accuracy": [ 3.948421363020316e-05 ] } }, "validation": { "combined": { "loss": [ NaN ] }, "output": { "char_error_rate": [ 1.0 ], "loss": [ NaN ], "next_token_perplexity": [ NaN ], "perplexity": [ NaN ], "sequence_accuracy": [ 0.0 ], "token_accuracy": [ 0.0 ] } } }
3条答案
按热度按时间2izufjch1#
你好,@msmmpts,
如果你使用非常高的学习率,NaN值的可能性会更大。我建议你尝试使用一个数量级较小的学习率,例如
0.0001
。6rqinv9w2#
你好,@justinxzhao ,
我尝试使用学习率为0.0001。同样的问题仍然存在。
dsekswqp3#
我同意这个观点。每次在第一个周期结束时,我都看到这个警告,然后得到以下错误:
这是我的
model.yaml
文件: