vllm [Bug]: 推测性解码死亡:IndexError:索引0超出维度0的范围,大小为0

vsdwdz23  于 4个月前  发布在  其他
关注(0)|答案(5)|浏览(38)

你当前的环境

docker pull vllm/vllm-openai:latest
docker run -d --restart=always \
    --runtime=nvidia \
    --gpus '"device=1"' \
    --shm-size=10.24gb \
    -p 5001:5001 \
        -e NCCL_IGNORE_DISABLED_P2P=1 \
        -e VLLM_NCCL_SO_PATH=/usr/local/lib/python3.10/dist-packages/nvidia/nccl/lib/libnccl.so.2 \
    -e HUGGING_FACE_HUB_TOKEN=$HUGGING_FACE_HUB_TOKEN \
    -v /etc/passwd:/etc/passwd:ro \
    -v /etc/group:/etc/group:ro \
    -u `id -u`:`id -g` \
    -v "${HOME}"/.cache:$HOME/.cache/ -v "${HOME}"/.config:$HOME/.config/   -v "${HOME}"/.triton:$HOME/.triton/  \
    --network host \
    --name phi3mini \
    vllm/vllm-openai:latest \
        --port=5001 \
        --host=0.0.0.0 \
        --model=microsoft/Phi-3-mini-128k-instruct \
        --seed 1234 \
        --trust-remote-code \
        --tensor-parallel-size=1 \
        --max-num-batched-tokens=131072 --max-log-len=100 \
        --max-model-len=131072 \
        --max-num-seqs=17 \
        --use-v2-block-manager \
        --num-speculative-tokens=5 \
        --ngram-prompt-lookup-max=4 \
        --speculative-model="[ngram]" \
        --download-dir=$HOME/.cache/huggingface/hub &>> logs.vllm_server.phi3.txt

描述一下bug

ERROR 08-01 21:27:03 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 375, in execute_model
ERROR 08-01 21:27:03 async_llm_engine.py:56]     return self._run_speculative_decoding_step(execute_model_req,
ERROR 08-01 21:27:03 async_llm_engine.py:56]   File "/usr/lib/python3.10/contextlib.py", line 79, in inner
ERROR 08-01 21:27:03 async_llm_engine.py:56]     return func(*args, **kwds)
ERROR 08-01 21:27:03 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 538, in _run_speculative_decoding_step
ERROR 08-01 21:27:03 async_llm_engine.py:56]     accepted_token_ids, target_logprobs = self._verify_tokens(
ERROR 08-01 21:27:03 async_llm_engine.py:56]   File "/usr/lib/python3.10/contextlib.py", line 79, in inner
ERROR 08-01 21:27:03 async_llm_engine.py:56]     return func(*args, **kwds)
ERROR 08-01 21:27:03 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/spec_decode/spec_decode_worker.py", line 609, in _verify_tokens
ERROR 08-01 21:27:03 async_llm_engine.py:56]     accepted_token_ids = self.spec_decode_sampler(
ERROR 08-01 21:27:03 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
ERROR 08-01 21:27:03 async_llm_engine.py:56]     return self._call_impl(*args, **kwargs)
ERROR 08-01 21:27:03 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1541, in _call_impl
ERROR 08-01 21:27:03 async_llm_engine.py:56]     return forward_call(*args, **kwargs)
ERROR 08-01 21:27:03 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/rejection_sampler.py", line 82, in forward
ERROR 08-01 21:27:03 async_llm_engine.py:56]     self._batch_modified_rejection_sampling(
ERROR 08-01 21:27:03 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/rejection_sampler.py", line 119, in _batch_modified_rejection_sampling
ERROR 08-01 21:27:03 async_llm_engine.py:56]     accepted = self._get_accepted(target_probs, draft_probs,
ERROR 08-01 21:27:03 async_llm_engine.py:56]   File "/usr/local/lib/python3.10/dist-packages/vllm/model_executor/layers/rejection_sampler.py", line 190, in _get_accepted
ERROR 08-01 21:27:03 async_llm_engine.py:56]     uniform_rand[idx, :] = torch.rand(1,
ERROR 08-01 21:27:03 async_llm_engine.py:56] IndexError: index 0 is out of bounds for dimension 0 with size 0

当我向模型发送“你是谁?”的第一条消息时,我收到了“我”,然后就死了。

q35jwt9p

q35jwt9p1#

也许你可以改变你的推测模型或者将 spec_decoding_acceptance_method 设置为 typical_acceptance_sampler 。在使用 '[ngram]' 时,在 RejectionSampler 源代码中存在一个错误。它无法处理形状为 (0, k) 的 draft_probs

3htmauhk

3htmauhk2#

有人修复这个bug吗?
cc @cadedaniel

toe95027

toe950273#

我很高兴尝试其他选项。它对其他人来说运行得很好,但对我的phi-3-mini-128k模型不起作用。立即失败。我可能会等到这个bug被修复后再尝试一次。
希望对于结构化输出,其他人能够获得相当好的加速。也就是说,对于guided_json和JSON输出,7b模型大约有5倍的改进。听起来很棒,但对我来说只是崩溃。

2guxujil

2guxujil4#

I'm happy to try other options. It was working well for someone else, but not for me on the phi-3-mini-128k model. Failed instantly. I'll probably wait until this bug is fixed before trying again.
The hope is that for structured output, others are getting quite good speed-up. i.e. for guided_json and JSON output, about 5x improvement for a 7b model. Sounds great, but just crashes for me.
Did you try adding --spec-decoding-acceptance-method='typical_acceptance_sampler' \ ? It works for me to avoid the crash.

gtlvzcf8

gtlvzcf85#

我很高兴尝试其他选项。它对其他人来说运行得很好,但对我的phi-3-mini-128k模型不起作用。立即失败。我可能会等到这个bug被修复后再尝试一次。
希望对于结构化输出,其他人能够获得相当好的加速效果。也就是说,对于guided_json和JSON输出,7b模型大约有5倍的改进。听起来很棒,但对我来说只是崩溃。
顺便说一下,你可以直接从主分支的源代码构建。我猜你正在使用的容器是用vllm版本v0.5.3或v0.5.3.post1构建的。( #6698 )已经修复了这个bug。或者,你可以等待v0.5.4的发布,这应该不会再导致崩溃。

相关问题