vllm [投机解码]: AttributeError: 'NoneType' 对象没有属性 'numel' 当超过草稿上下文长度时

vuktfyat  于 2个月前  发布在  其他
关注(0)|答案(3)|浏览(28)

当前环境:vllm-0.4.3

问题描述:当我使用推测模式(speculative mode)并且prompt_length+output_length > 2048时,会出现错误。当我使用以下参数时,会产生这个错误:

参数:
Base_model: llama2-70B
Speculative_model: llama1.1b
engine_args = EngineArgs(
model=base_path,
tokenizer=base_path,
trust_remote_code=True,
tensor_parallel_size=4,
gpu_memory_utilization=0.90,
enforce_eager=True,
speculative_model=draft_path, num_speculative_tokens=4,
dtype=torch.float16,
use_v2_block_manager=True
)
Prompt tokens长度为2040,输出tokens长度为50(当prompt_length+output_length > 2048时,出现错误)

错误信息:
RayWorkerWrapper pid=142935) ERROR 06-07 16:53:17 worker_base.py:148] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution. [repeated 2x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] Traceback (most recent call last): [repeated 2x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/worker/worker_base.py", line 140, in execute_method [repeated 2x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] return executor(*args, **kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [repeated 4x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] return func(*args, **kwargs) [repeated 4x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [repeated 4x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] return func(*args, **kwargs) [repeated 4x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/spec_decode/spec_decode_worker.py", line 297, in start_worker_execution_loop [repeated 2x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] while self._run_non_driver_rank(): [repeated 2x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/spec_decode/spec_decode_worker.py", line 366, in _run_non_driver_rank [repeated 2x across cluster]
(RayWorkerWrapper pid=143

xkrw2x1b

xkrw2x1b2#

你好,张小伟。你能帮我确认一下吗?你的草案模型支持的最大上下文长度是多少?

agyaoht7

agyaoht73#

Hi @zhangxy1234 ,

The maximum context length supported by your draft model is 2048. When the tp parameter is set to 1, it will stop at 2048 but raise an error when tp > 1. This is because the _execute_model_non_driver method in the worker module tries to swap out tensors of size larger than 2048, which is not allowed in the current implementation.

To resolve this issue, you can try increasing the value of max_context_len in the config.yaml file to a higher value that allows for larger tensor swaps. For example, you can set max_context_len to 4096 or higher. However, please note that setting a very high value may cause memory issues and impact the performance of your model.

Another option is to modify the _execute_model_non_driver method in the worker module to handle tensor swaps of size larger than 2048. You can do this by checking the size of the tensor being swapped and only swapping if it is within the allowed range.

I hope this helps! Let me know if you have any further questions.
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=7582) return func(*args, **kwargs)
(RayWorkerWrapper pid=7582) │ │ └ {}
(RayWorkerWrapper pid=7582) │ └ (<vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610>,)
(RayWorkerWrapper pid=7582) └ <function Worker.execute_model at 0x7fa408386310>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File " vllm-main/vllm/worker/worker.py", line 236, in execute_model
(RayWorkerWrapper pid=7582) self._execute_model_non_driver()
(RayWorkerWrapper pid=7582) │ └ <function Worker._execute_model_non_driver at 0x7fa408386550>
(RayWorkerWrapper pid=7582) └ <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "vllm-main/vllm/worker/worker.py", line 311, in _execute_model_non_driver
(RayWorkerWrapper pid=7582) self.cache_swap(blocks_to_swap_in, blocks_to_swap_out, blocks_to_copy)
(RayWorkerWrapper pid=7582) │ │ │ │ └ None
(RayWorkerWrapper pid=7582) │ │ │ └ None
(RayWorkerWrapper pid=7582) │ │ └ None
(RayWorkerWrapper pid=7582) │ └ <function Worker.cache_swap at 0x7fa408386280>
(RayWorkerWrapper pid=7582) └ <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "vllm-main/vllm/worker/worker.py", line 223, in cache_swap
(RayWorkerWrapper pid=7582) if blocks_to_swap_in.numel() > 0:
(RayWorkerWrapper pid=7582) └ None
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) AttributeError: 'NoneType' object has no attribute 'numel'

相关问题