你的当前环境
See below for detailed setup and run script that I use.
🐛 描述bug
你好,我正在尝试在AWS inferentia (inf2.8xlarge)示例上使用vllm部署llama-8b。经过许多hack/tiring尝试,我已经确保vllm服务器能够正确地启动。然而,当我尝试对一个简单的"hi"输入提示进行模型推理时,它会在控制台上显示一个错误警告,并且我在设置的Gradio UI上没有得到任何llm返回的结果。有关代码相关细节的详细信息,请参阅线程。如果可能的话,希望有人能帮助我解决以下问题!我正在使用Skypilot进行部署:
(task, pid=33413) INFO 06-21 09:15:21 metrics.py:341] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
(task, pid=33413) INFO: 127.0.0.1:60198 - "POST /v1/chat/completions HTTP/1.1" 200 OK
(task, pid=33413) INFO 06-21 09:15:27 async_llm_engine.py:582] Received request cmpl-410ee0fe3db44e05a79d0112fb3ec571: prompt: '<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a great ai assistant.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nhi<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n', params: SamplingParams(n=1, best_of=1, presence_penalty=0.0, frequency_penalty=0.0, repetition_penalty=1.0, temperature=0.8, top_p=1.0, top_k=-1, min_p=0.0, seed=None, use_beam_search=False, length_penalty=1.0, early_stopping=False, stop=[], stop_token_ids=[128009, 128001], include_stop_str_in_output=False, ignore_eos=False, max_tokens=2025, min_tokens=0, logprobs=None, prompt_logprobs=None, skip_special_tokens=True, spaces_between_special_tokens=True, truncate_prompt_tokens=None), prompt_token_ids: [128000, 128006, 9125, 128007, 271, 2675, 527, 264, 2294, 16796, 18328, 13, 128009, 128006, 882, 128007, 271, 6151, 128009, 128006, 78191, 128007, 271], lora_request: None.
(task, pid=33413) WARNING 06-21 09:15:27 scheduler.py:683] Input prompt (23 tokens) is too long and exceeds the capacity of block_manager
这是我在示例中设置的vllm特定内容:
. /etc/os-release
sudo tee /etc/apt/sources.list.d/neuron.list > /dev/null <<EOF
deb https://apt.repos.neuron.amazonaws.com ${VERSION_CODENAME} main
EOF
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | sudo apt-key add -
sudo apt-get update -y
# Install OS headers
sudo apt-get install linux-headers-$(uname -r) -y
# Install git
sudo apt-get install git -y
# Install Neuron Driver
sudo apt-get install aws-neuronx-dkms=2.* -y
# Install Neuron Runtime
sudo apt-get install aws-neuronx-collectives=2.* -y
sudo apt-get install aws-neuronx-runtime-lib=2.* -y
# Install Neuron Tools
sudo apt-get install aws-neuronx-tools=2.* -y
# Add PATH
export PATH=/opt/aws/neuron/bin:$PATH
# Install Python venv
sudo apt-get install -y python3.10-venv g++
# Create Python venv
python3.10 -m venv aws_neuron_venv_pytorch
# Activate Python venv
source aws_neuron_venv_pytorch/bin/activate
# Install Jupyter notebook kernel
pip install ipykernel
python3.10 -m ipykernel install --user --name aws_neuron_venv_pytorch --display-name "Python (torch-neuronx)"
pip install jupyter notebook
pip install environment_kernels
# Set pip repository pointing to the Neuron repository
python -m pip config set global.extra-index-url https://pip.repos.neuron.amazonaws.com
# Install wget, awscli
python -m pip install wget
python -m pip install awscli
# Update Neuron Compiler and Framework
python -m pip install --upgrade neuronx-cc==2.* --pre torch-neuronx==2.1.* torchvision transformers-neuronx
# Install vLLM from source
git clone https://github.com/vllm-project/vllm.git
touch ./vllm/model_executor/models/neuron/__init__.py
cd vllm
pip install -U -r requirements-neuron.txt
# Create an empty __init__.py file in the neuron directory
pip install .
# Install Gradio for web UI
pip install gradio openai
这是我运行服务器的方式:
source aws_neuron_venv_pytorch/bin/activate
echo 'Starting vllm api server...'
export LD_LIBRARY_PATH="/opt/conda/lib/:$LD_LIBRARY_PATH"
export PATH=/opt/aws/neuron/bin:$PATH
export NEURON_RT_VISIBLE_CORES=0-1
# NOTE: --gpu-memory-utilization 0.95 needed for 4-GPU nodes.
python -u -m vllm.entrypoints.openai.api_server \
--port 8081 \
--model $MODEL_NAME \
--trust-remote-code \
--max-num-seqs 1 \
--device neuron \
--max-model-len 2048 \
2>&1 | tee api_server.log &
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do
echo 'Waiting for vllm api server to start...'
sleep 5
done
echo 'Starting gradio server...'
git clone https://github.com/vllm-project/vllm.git || true
python vllm/examples/gradio_openai_chatbot_webserver.py \
-m $MODEL_NAME \
--port 8811 \
--model-url http://localhost:8081/v1 \
--stop-token-ids 128009,128001
我立刻想到的是NEURON_RT_VISIBLE_CORES环境变量,我尝试将其增加到大于0-1的范围,例如0-3,但是vllm服务器失败了,甚至无法启动。这是在inf2.8xlarge示例上。每个inf2加速器有8个核心(而8xlarge有一个单独的inferentia加速器),因此这本应该是0-7,但即使比这个值更小也不起作用吗?
我尝试将max-model-len增加到4096,但即使这样也无法启动vllm服务器并使其失败。
(task, pid=34615) performing partition vectorization on AG_2[[0, 1032, 0, 0, 0, 0]]{2 nodes (1 sources, 0 stops)}. dags covered: {dag_1036_TC_SRC, dag_1032}
(task, pid=34615) ..Waiting for vllm api server to start...
(task, pid=34615) root = /opt/conda/lib/python3.10/multiprocessing/process.py
(task, pid=34615) root = /opt/conda/lib/python3.10/multiprocessing
(task, pid=34615) root = /opt/conda/lib/python3.10
(task, pid=34615) root = /opt/conda/lib
(task, pid=34615) root = /opt/conda
(task, pid=34615) root = /opt
(task, pid=34615)
(task, pid=34615) 2024-06-21 09:33:40.000866: 38168 ERROR ||NEURON_CC_WRAPPER||: Failed compilation with ['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']: 2024-06-21T09:33:40Z [PGT002] Too many instructions after unroll! - Compiling under --optlevel=1 may result in smaller graphs. If you are using a transformer model, try using a smaller context_length_estimate value.
(task, pid=34615)
(task, pid=34615) 2024-06-21 09:33:40.000866: 38168 ERROR ||NEURON_CC_WRAPPER||: Compilation failed for /tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.hlo_module.pb after 0 retries.
(task, pid=34615) 2024-06-21 09:33:40.000867: 38168 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
(task, pid=34615) Waiting for vllm api server to start...
(task, pid=34615) Compiler status PASS
(task, pid=34615) 2024-06-21 09:36:42.000494: 38167 INFO ||NEURON_CACHE||: Compile cache path: /var/tmp/neuron-compile-cache
(task, pid=34615) concurrent.futures.process._RemoteTraceback:
(task, pid=34615) """
(task, pid=34615) Traceback (most recent call last):
(task, pid=34615) File "/opt/conda/lib/python3.10/concurrent/futures/process.py", line 246, in _process_worker
(task, pid=34615) r = call_item.fn(*call_item.args, **call_item.kwargs)
(task, pid=34615) File "/home/ubuntu/sky_workdir/aws_neuron_venv_pytorch/lib/python3.10/site-packages/libneuronxla/neuron_cc_wrapper.py", line 163, in call_neuron_compiler
(task, pid=34615) raise subprocess.CalledProcessError(res.returncode, cmd, stderr=error_info)
(task, pid=34615) subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']' returned non-zero exit status 70.
(task, pid=34615) subprocess.CalledProcessError: Command '['neuronx-cc', 'compile', '--target=trn1', '--framework=XLA', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.hlo_module.pb', '--output', '/tmp/ubuntu/neuroncc_compile_workdir/270aa309-fc42-41dc-8e08-d69177b6ded8/model.MODULE_39ea0777b538681fb821+2c2d707e.neff', '--model-type=transformer', '--auto-cast=none', '--verbose=35']' returned non-zero exit status 70.
将--max-num-seqs增加到>1也会导致启动vllm服务器失败。有人能帮我解决这个问题吗? 🙏
我已经尝试了许多方法,但大多数方法都在vllm方面失败了。 😦
请帮助我解决上述问题!
6条答案
按热度按时间iswrvxsc1#
CC @liangfu
toe950272#
@liangfu希望你能帮助解决上述问题!
jdgnovmf3#
@aws-patlange could you please look into this?
hlswsv354#
我们目前不支持神经元集成中的分页注意力。您需要将
block-size
显式设置为max-model-len
。请参阅 https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/transformers-neuronx/transformers-neuronx-developer-guide-for-continuous-batching.html 。这可能需要在这里进行一些编辑,以便将其传递给 vllm 中提供的 API 入口点之一。
ybzsozfc5#
请在编辑当前限制
--block-size
只能使用某些特定值的参数解析器后尝试以下操作:mspsb9vt6#
你好,我使用了你的命令,但是遇到了一个错误:
TypeError: 无法示例化具有抽象方法 execute_worker 的抽象类 NeuronWorker。有什么建议吗?谢谢!