你好!我使用auto-gptq将llama-2-7b-instruct
模型量化为llama-2-7b-instruct-4bit-128g
。我尝试比较它们之间的速度,但结果非常奇怪。量化模型的存储确实减少了,但令牌生成速度的提高非常有限。如下所示:
原始模型:
INFO [__main__] generated 5120 tokensusing 109.5500602722168 seconds, generation speed: 46.736624217983135tokens/s
量化模型:
2024-08-21 03:12:55 INFO [__main__] generated 5120 tokens using 107.57917785644531 seconds, generation speed: 47.592853022470365tokens/s
所以我尝试添加一些参数来解决这个问题。但是出现了错误。
我使用的脚本:
if from_pretrained:
model = AutoGPTQForCausalLM.from_pretrained(
pretrained_model_name_or_path=model_name_or_path,
quantize_config=BaseQuantizeConfig(),
max_memory=max_memory,
trust_remote_code=trust_remote_code
).to("cuda:0")
else:
model = AutoGPTQForCausalLM.from_quantized(
model_name_or_path,
max_memory=max_memory,
low_cpu_mem_usage=True,
use_triton=use_triton,
inject_fused_attention=inject_fused_attention,
inject_fused_mlp=inject_fused_mlp,
use_cuda_fp16=False,
quantize_config=quantize_config,
model_basename=model_basename,
use_safetensors=use_safetensors,
trust_remote_code=trust_remote_code,
warmup_triton=False,
disable_exllama=disable_exllama,
use_marlin=True
)
我遇到了下面的错误,如何解决它?
output:
(base) root@b505318137a4:/AutoGPTQ/liwenyuan/examples/benchmark# bash generate_speed.sh
/root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:410: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd
/root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:418: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
@custom_bwd
/root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
@custom_fwd(cast_inputs=torch.float16)
2024-08-21 03:38:43 INFO [__main__] max_memory: None
2024-08-21 03:38:43 INFO [__main__] loading model and tokenizer
WARNING - You have activated both exllama and exllamav2 kernel. Setting disable_exllama to True and keeping disable_exllamav2 to False
INFO - The layer lm_head is not quantized.
2024-08-21 03:38:44 INFO [accelerate.utils.modeling] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 454/454 [00:20<00:00, 21.82it/s]
Traceback (most recent call last):
File "/AutoGPTQ/liwenyuan/examples/benchmark/generation_speed.py", line 327, in <module>
main()
File "/AutoGPTQ/liwenyuan/examples/benchmark/generation_speed.py", line 273, in main
model, tokenizer = load_model_tokenizer(
^^^^^^^^^^^^^^^^^^^^^
File "/AutoGPTQ/liwenyuan/examples/benchmark/generation_speed.py", line 170, in load_model_tokenizer
model = AutoGPTQForCausalLM.from_quantized(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/auto_gptq/modeling/auto.py", line 146, in from_quantized
return quant_func(
^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/auto_gptq/modeling/_base.py", line 1096, in from_quantized
model, model_save_name = prepare_model_for_marlin_load(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/auto_gptq/utils/marlin_utils.py", line 84, in prepare_model_for_marlin_load
safe_save(model.state_dict(), model_save_name)
File "/root/miniconda3/lib/python3.12/site-packages/safetensors/torch.py", line 286, in save_file
serialize_file(_flatten(tensors), filename, metadata=metadata)
^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/safetensors/torch.py", line 500, in _flatten
"data": _tobytes(v, k),
^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/safetensors/torch.py", line 422, in _tobytes
tensor = tensor.to("cpu")
^^^^^^^^^^^^^^^^
NotImplementedError: Cannot copy out of meta tensor; no data!
1条答案
按热度按时间dzjeubhm1#
如果设置
use_triton=True