你当前的环境
docker pull vllm/vllm-openai:latest
docker run -d --restart=always \
--runtime=nvidia \
--gpus '"device=1"' \
--shm-size=10.24gb \
-p 5001:5001 \
-e NCCL_IGNORE_DISABLED_P2P=1 \
-e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
-e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
-v /etc/passwd:/etc/passwd:ro \
-v /etc/group:/etc/group:ro \
-u `id -u`:`id -g` \
-v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/ -v "${HOME}"/.triton:$HOME/.triton/ \
--network host \
--name phi3mini \
vllm/vllm-openai:latest \
--port=5001 \
--host=0.0.0.0 \
--model=microsoft/Phi-3-mini-128k-instruct \
--seed 1234 \
--trust-remote-code \
--tensor-parallel-size=1 \
--max-num-batched-tokens=131072 --max-log-len=100 \
--max-model-len=131072 \
--max-num-seqs=17 \
--use-v2-block-manager \
--num-speculative-tokens=5 \
--ngram-prompt-lookup-max=4 \
--speculative-model="[ngram]" \
--download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.phi3.txt
描述一下bug
ERROR 08-01 21:27:03 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 375, in execute_model
ERROR 08-01 21:27:03 async_llm_engine.py:56] return self._run_speculative_decoding_step(execute_model_req,
ERROR 08-01 21:27:03 async_llm_engine.py:56] File "/usr/lib/python3.10/contextlib.py", line 79, in inner
ERROR 08-01 21:27:03 async_llm_engine.py:56] return func(*args, **kwds)
ERROR 08-01 21:27:03 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 538, in _run_speculative_decoding_step
ERROR 08-01 21:27:03 async_llm_engine.py:56] accepted_token_ids, target_logprobs = self._verify_tokens(
ERROR 08-01 21:27:03 async_llm_engine.py:56] File "/usr/lib/python3.10/contextlib.py", line 79, in inner
ERROR 08-01 21:27:03 async_llm_engine.py:56] return func(*args, **kwds)
ERROR 08-01 21:27:03 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 609, in _verify_tokens
ERROR 08-01 21:27:03 async_llm_engine.py:56] accepted_token_ids = self.spec_decode_sampler(
ERROR 08-01 21:27:03 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 21:27:03 async_llm_engine.py:56] return self._call_impl(*args, **kwargs)
ERROR 08-01 21:27:03 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 21:27:03 async_llm_engine.py:56] return forward_call(*args, **kwargs)
ERROR 08-01 21:27:03 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/rejection_sampler.py", line 82, in forward
ERROR 08-01 21:27:03 async_llm_engine.py:56] self._batch_modified_rejection_sampling(
ERROR 08-01 21:27:03 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/rejection_sampler.py", line 119, in _batch_modified_rejection_sampling
ERROR 08-01 21:27:03 async_llm_engine.py:56] accepted = self._get_accepted(target_probs, draft_probs,
ERROR 08-01 21:27:03 async_llm_engine.py:56] File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/rejection_sampler.py", line 190, in _get_accepted
ERROR 08-01 21:27:03 async_llm_engine.py:56] uniform_rand[idx, :] = torch.rand(1,
ERROR 08-01 21:27:03 async_llm_engine.py:56] IndexError: index 0 is out of bounds for dimension 0 with size 0
当我向模型发送“你是谁?”的第一条消息时,我收到了“我”,然后就死了。
5条答案
按热度按时间q35jwt9p1#
也许你可以改变你的推测模型或者将
spec_decoding_acceptance_method
设置为typical_acceptance_sampler
。在使用 '[ngram]' 时,在RejectionSampler
源代码中存在一个错误。它无法处理形状为 (0, k) 的draft_probs
。3htmauhk2#
有人修复这个bug吗?
cc @cadedaniel
toe950273#
我很高兴尝试其他选项。它对其他人来说运行得很好,但对我的phi-3-mini-128k模型不起作用。立即失败。我可能会等到这个bug被修复后再尝试一次。
希望对于结构化输出,其他人能够获得相当好的加速。也就是说,对于guided_json和JSON输出,7b模型大约有5倍的改进。听起来很棒,但对我来说只是崩溃。
2guxujil4#
I'm happy to try other options. It was working well for someone else, but not for me on the phi-3-mini-128k model. Failed instantly. I'll probably wait until this bug is fixed before trying again.
The hope is that for structured output, others are getting quite good speed-up. i.e. for guided_json and JSON output, about 5x improvement for a 7b model. Sounds great, but just crashes for me.
Did you try adding
--spec-decoding-acceptance-method='typical_acceptance_sampler' \
? It works for me to avoid the crash.gtlvzcf85#
我很高兴尝试其他选项。它对其他人来说运行得很好,但对我的phi-3-mini-128k模型不起作用。立即失败。我可能会等到这个bug被修复后再尝试一次。
希望对于结构化输出,其他人能够获得相当好的加速效果。也就是说,对于guided_json和JSON输出,7b模型大约有5倍的改进。听起来很棒,但对我来说只是崩溃。
顺便说一下,你可以直接从主分支的源代码构建。我猜你正在使用的容器是用vllm版本v0.5.3或v0.5.3.post1构建的。( #6698 )已经修复了这个bug。或者,你可以等待v0.5.4的发布,这应该不会再导致崩溃。