AutoGPTQ 当我设置use_marlin=True时出现的错误信息

11dmarpk  于 5个月前  发布在  其他
关注(0)|答案(1)|浏览(77)

你好!我使用auto-gptq将llama-2-7b-instruct模型量化为llama-2-7b-instruct-4bit-128g。我尝试比较它们之间的速度,但结果非常奇怪。量化模型的存储确实减少了,但令牌生成速度的提高非常有限。如下所示:

原始模型:

INFO [__main__] generated 5120 tokensusing 109.5500602722168 seconds, generation speed: 46.736624217983135tokens/s

量化模型:

2024-08-21 03:12:55 INFO [__main__] generated 5120 tokens using 107.57917785644531 seconds, generation speed: 47.592853022470365tokens/s

所以我尝试添加一些参数来解决这个问题。但是出现了错误。

我使用的脚本:

if from_pretrained:
        model = AutoGPTQForCausalLM.from_pretrained(
            pretrained_model_name_or_path=model_name_or_path,
            quantize_config=BaseQuantizeConfig(),
            max_memory=max_memory,
            trust_remote_code=trust_remote_code
        ).to("cuda:0")
    else:
        model = AutoGPTQForCausalLM.from_quantized(
            model_name_or_path,
            max_memory=max_memory,
            low_cpu_mem_usage=True,
            use_triton=use_triton,
            inject_fused_attention=inject_fused_attention,
            inject_fused_mlp=inject_fused_mlp,
            use_cuda_fp16=False,
            quantize_config=quantize_config,
            model_basename=model_basename,
            use_safetensors=use_safetensors,
            trust_remote_code=trust_remote_code,
            warmup_triton=False,
            disable_exllama=disable_exllama,
            use_marlin=True
        )

我遇到了下面的错误,如何解决它?

output:
(base) root@b505318137a4:/AutoGPTQ/liwenyuan/examples/benchmark# bash generate_speed.sh 
/root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:410: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:418: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
2024-08-21 03:38:43 INFO [__main__] max_memory: None
2024-08-21 03:38:43 INFO [__main__] loading model and tokenizer
WARNING - You have activated both exllama and exllamav2 kernel. Setting disable_exllama to True and keeping disable_exllamav2 to False
INFO - The layer lm_head is not quantized.
2024-08-21 03:38:44 INFO [accelerate.utils.modeling] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
Repacking weights to be compatible with Marlin kernel...: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 454/454 [00:20<00:00, 21.82it/s]
Traceback (most recent call last):
  File "/AutoGPTQ/liwenyuan/examples/benchmark/generation_speed.py", line 327, in <module>
    main()
  File "/AutoGPTQ/liwenyuan/examples/benchmark/generation_speed.py", line 273, in main
    model, tokenizer = load_model_tokenizer(
                       ^^^^^^^^^^^^^^^^^^^^^
  File "/AutoGPTQ/liwenyuan/examples/benchmark/generation_speed.py", line 170, in load_model_tokenizer
    model = AutoGPTQForCausalLM.from_quantized(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/auto_gptq/modeling/auto.py", line 146, in from_quantized
    return quant_func(
           ^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/auto_gptq/modeling/_base.py", line 1096, in from_quantized
    model, model_save_name = prepare_model_for_marlin_load(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/auto_gptq/utils/marlin_utils.py", line 84, in prepare_model_for_marlin_load
    safe_save(model.state_dict(), model_save_name)
  File "/root/miniconda3/lib/python3.12/site-packages/safetensors/torch.py", line 286, in save_file
    serialize_file(_flatten(tensors), filename, metadata=metadata)
                   ^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/safetensors/torch.py", line 500, in _flatten
    "data": _tobytes(v, k),
            ^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/safetensors/torch.py", line 422, in _tobytes
    tensor = tensor.to("cpu")
             ^^^^^^^^^^^^^^^^
NotImplementedError: Cannot copy out of meta tensor; no data!
dzjeubhm

dzjeubhm1#

如果设置 use_triton=True

/root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:410: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd
/root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:418: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @custom_bwd
/root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py:461: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @custom_fwd(cast_inputs=torch.float16)
2024-08-21 04:43:20 INFO [__main__] max_memory: None
2024-08-21 04:43:20 INFO [__main__] loading model and tokenizer
WARNING - You have activated both exllama and exllamav2 kernel. Setting disable_exllama to True and keeping disable_exllamav2 to False
INFO - You passed a model that is compatible with the Marlin int4*fp16 GPTQ kernel but use_marlin is False. We recommend using `use_marlin=True` to use the optimized Marlin kernels for inference. Example: `model = AutoGPTQForCausalLM.from_quantized(..., use_marlin=True)`.
INFO - The layer lm_head is not quantized.
2024-08-21 04:43:20 INFO [accelerate.utils.modeling] We will use 90% of the memory on device 0 for storing the model, and 10% for the buffer to avoid OOM. You can set `max_memory` in to a higher value to use more memory (at your own risk).
2024-08-21 04:43:22 INFO [__main__] model and tokenizer loading time: 2.1789s
2024-08-21 04:43:22 INFO [__main__] model quantized: True
2024-08-21 04:43:22 INFO [__main__] quantize config: {'bits': 4, 'group_size': 128, 'damp_percent': 0.01, 'desc_act': False, 'static_groups': False, 'sym': True, 'true_sequential': True, 'model_name_or_path': '/AutoGPTQ/liwenyuan/examples/quantization/llama-2-7b-instruct-4bit-128g', 'model_file_base_name': 'gptq_model-4bit-128g', 'quant_method': 'gptq', 'checkpoint_format': 'gptq'}
2024-08-21 04:43:22 INFO [__main__] model device map: OrderedDict({'': 0})
2024-08-21 04:43:22 INFO [__main__] warmup triton, this may take a while.
2024-08-21 04:43:22 INFO [auto_gptq.nn_modules.qlinear.qlinear_triton] Found 4 unique KN Linear values.
2024-08-21 04:43:22 INFO [auto_gptq.nn_modules.qlinear.qlinear_triton] Warming up autotune cache ...
  0%|                                                                                                                             | 0/13 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "/AutoGPTQ/liwenyuan/examples/benchmark/generation_speed.py", line 326, in <module>
    main()
  File "/AutoGPTQ/liwenyuan/examples/benchmark/generation_speed.py", line 295, in main
    model.warmup_triton()
  File "/root/miniconda3/lib/python3.12/site-packages/auto_gptq/modeling/_base.py", line 1256, in warmup_triton
    QuantLinear.warmup(self.model, seqlen=self.model.seqlen)
  File "/root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/qlinear/qlinear_triton.py", line 214, in warmup
    quant_matmul_inference_only_248(a, qweight, scales, qzeros, g_idx, bits, maxq)
  File "/root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/kernels.py", line 435, in quant_matmul_inference_only_248
    quant_matmul_248_kernel[grid](
  File "/root/miniconda3/lib/python3.12/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/miniconda3/lib/python3.12/site-packages/auto_gptq/nn_modules/triton_utils/custom_autotune.py", line 105, in run
    self.cache[key] = builtins.min(timings, key=timings.get)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: '<' not supported between instances of 'list' and 'tuple'

相关问题