当前环境:vllm-0.4.3
问题描述:当我使用推测模式(speculative mode)并且prompt_length+output_length > 2048时,会出现错误。当我使用以下参数时,会产生这个错误:
参数:
Base_model: llama2-70B
Speculative_model: llama1.1b
engine_args = EngineArgs(
model=base_path,
tokenizer=base_path,
trust_remote_code=True,
tensor_parallel_size=4,
gpu_memory_utilization=0.90,
enforce_eager=True,
speculative_model=draft_path, num_speculative_tokens=4,
dtype=torch.float16,
use_v2_block_manager=True
)
Prompt tokens长度为2040,输出tokens长度为50(当prompt_length+output_length > 2048时,出现错误)
错误信息:
RayWorkerWrapper pid=142935) ERROR 06-07 16:53:17 worker_base.py:148] Error executing method start_worker_execution_loop. This might cause deadlock in distributed execution. [repeated 2x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] Traceback (most recent call last): [repeated 2x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/worker/worker_base.py", line 140, in execute_method [repeated 2x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] return executor(*args, **kwargs) [repeated 2x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [repeated 4x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] return func(*args, **kwargs) [repeated 4x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context [repeated 4x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] return func(*args, **kwargs) [repeated 4x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/spec_decode/spec_decode_worker.py", line 297, in start_worker_execution_loop [repeated 2x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] while self._run_non_driver_rank(): [repeated 2x across cluster]
(RayWorkerWrapper pid=143111) ERROR 06-07 16:53:17 worker_base.py:148] File "/vllm-main/vllm/spec_decode/spec_decode_worker.py", line 366, in _run_non_driver_rank [repeated 2x across cluster]
(RayWorkerWrapper pid=143
3条答案
按热度按时间3wabscal1#
@cadedaniel
xkrw2x1b2#
你好,张小伟。你能帮我确认一下吗?你的草案模型支持的最大上下文长度是多少?
agyaoht73#
Hi @zhangxy1234 ,
The maximum context length supported by your draft model is 2048. When the
tp
parameter is set to 1, it will stop at 2048 but raise an error whentp > 1
. This is because the_execute_model_non_driver
method in the worker module tries to swap out tensors of size larger than 2048, which is not allowed in the current implementation.To resolve this issue, you can try increasing the value of
max_context_len
in theconfig.yaml
file to a higher value that allows for larger tensor swaps. For example, you can setmax_context_len
to 4096 or higher. However, please note that setting a very high value may cause memory issues and impact the performance of your model.Another option is to modify the
_execute_model_non_driver
method in the worker module to handle tensor swaps of size larger than 2048. You can do this by checking the size of the tensor being swapped and only swapping if it is within the allowed range.I hope this helps! Let me know if you have any further questions.
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "/home/ma-user/anaconda3/envs/PyTorch-2.0.0/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
(RayWorkerWrapper pid=7582) return func(*args, **kwargs)
(RayWorkerWrapper pid=7582) │ │ └ {}
(RayWorkerWrapper pid=7582) │ └ (<vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610>,)
(RayWorkerWrapper pid=7582) └ <function Worker.execute_model at 0x7fa408386310>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File " vllm-main/vllm/worker/worker.py", line 236, in execute_model
(RayWorkerWrapper pid=7582) self._execute_model_non_driver()
(RayWorkerWrapper pid=7582) │ └ <function Worker._execute_model_non_driver at 0x7fa408386550>
(RayWorkerWrapper pid=7582) └ <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "vllm-main/vllm/worker/worker.py", line 311, in _execute_model_non_driver
(RayWorkerWrapper pid=7582) self.cache_swap(blocks_to_swap_in, blocks_to_swap_out, blocks_to_copy)
(RayWorkerWrapper pid=7582) │ │ │ │ └ None
(RayWorkerWrapper pid=7582) │ │ │ └ None
(RayWorkerWrapper pid=7582) │ │ └ None
(RayWorkerWrapper pid=7582) │ └ <function Worker.cache_swap at 0x7fa408386280>
(RayWorkerWrapper pid=7582) └ <vllm.spec_decode.multi_step_worker.MultiStepWorker object at 0x7fa4096df610>
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) File "vllm-main/vllm/worker/worker.py", line 223, in cache_swap
(RayWorkerWrapper pid=7582) if blocks_to_swap_in.numel() > 0:
(RayWorkerWrapper pid=7582) └ None
(RayWorkerWrapper pid=7582)
(RayWorkerWrapper pid=7582) AttributeError: 'NoneType' object has no attribute 'numel'