[用法]:如何设置vllm以在k8s/openshift集群中工作

o4hqfura 于 5个月前发布在其他

关注(0)|答案(9)|浏览(90)

当前环境

编辑1

Collecting environment information...
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.14.0-284.59.1.el9_2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S

Nvidia driver version: 550.54.14
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             128
On-line CPU(s) list:                0-127
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9334 32-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU max MHz:                        3910.2529
CPU min MHz:                        1500.0000
BogoMIPS:                           5399.76
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          2 MiB (64 instances)
L1i cache:                          2 MiB (64 instances)
L2 cache:                           64 MiB (64 instances)
L3 cache:                           256 MiB (8 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-31,64-95
NUMA node1 CPU(s):                  32-63,96-127
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.1
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu12==2.18.1.0.3.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     32-63,96-127    1               N/A
GPU1    SYS      X      SYS     SYS     32-63,96-127    1               N/A
NIC0    SYS     SYS      X      PIX
NIC1    SYS     SYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

您希望如何使用vllm

我想在一个已经安装了nvidia GPU operator的openshift集群上运行一个mixtral-8x7b-instruct的推理。当我运行以下yaml文件时，我得到以下冻结的日志输出：

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mixtral-8x7b-instruct-vllm-pod
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mixtral-8x7b-instruct-vllm-pod
  template:
    metadata:
      labels:
        app: mixtral-8x7b-instruct-vllm-pod
    spec:
      containers:
      - name: mixtral-8x7b-instruct-vllm-pod
        image:  vllm/vllm-openai:v0.2.7
        args: ["--model", "mistralai/Mixtral-8x7B-Instruct-v0.1", "--tensor-parallel-size", "2", "--dtype", "half"]
        ports:
        - containerPort: 8000
        volumeMounts:
        - name: huggingface-cache
          mountPath: /root/.cache/huggingface
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          value: xxxxxxx
        resources:
          limits:
            nvidia.com/gpu: "2"
      volumes:
      - name: huggingface-cache
        persistentVolumeClaim:
          claimName: example-pv-filesystem
      hostIPC: true

注意如果我使用较小的mistral模型和一个GPU,它按预期工作。只有在添加2个或更多GPU时，它才会冻结。

vllm

来源：https://github.com/vllm-project/vllm/issues/4462

9条答案

按热度按时间

edqdpe6u1#

v0.2.7版本相当旧；您可以尝试使用当前的v0.4.1版本吗？
此外，从容器内部的'collect_env.py'输出也会很有帮助，例如：
$ kubectl exec -- python3 -c 'import requests; exec(requests.get(" https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py").text.text) )'

赞(0）回复(0）举报 5个月前

flvlnr442#

v0.2.7相当旧；你能尝试使用当前的v0.4.1吗？
此外，从容器内部的'collect_env.py'输出也会很有帮助，例如：
$ kubectl exec -- python3 -c 'import requests; exec(requests.get(" https://raw.githubusercontent.com/vllm-project/vllm/main/collect_env.py").text .text))'
感谢您的回复！
我应该提到我测试了标签latest、v0.4.1、v0.4.0、v0.3.2和v0.2.7,因为这与#4455有关的一个问题。
然而，在vllm/vllm-openai:latest pod内，我运行了collect_env.py

Collecting environment information...
PyTorch version: 2.2.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.3 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.14.0-284.59.1.el9_2.x86_64-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: 
GPU 0: NVIDIA L40S
GPU 1: NVIDIA L40S

Nvidia driver version: 550.54.14
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      52 bits physical, 57 bits virtual
Byte Order:                         Little Endian
CPU(s):                             128
On-line CPU(s) list:                0-127
Vendor ID:                          AuthenticAMD
Model name:                         AMD EPYC 9334 32-Core Processor
CPU family:                         25
Model:                              17
Thread(s) per core:                 2
Core(s) per socket:                 32
Socket(s):                          2
Stepping:                           1
Frequency boost:                    enabled
CPU max MHz:                        3910.2529
CPU min MHz:                        1500.0000
BogoMIPS:                           5399.76
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          2 MiB (64 instances)
L1i cache:                          2 MiB (64 instances)
L2 cache:                           64 MiB (64 instances)
L3 cache:                           256 MiB (8 instances)
NUMA node(s):                       2
NUMA node0 CPU(s):                  0-31,64-95
NUMA node1 CPU(s):                  32-63,96-127
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; Safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.19.3
[pip3] torch==2.2.1
[pip3] triton==2.2.0
[pip3] vllm-nccl-cu12==2.18.1.0.3.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0    GPU1    NIC0    NIC1    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      SYS     SYS     SYS     32-63,96-127    1               N/A
GPU1    SYS      X      SYS     SYS     32-63,96-127    1               N/A
NIC0    SYS     SYS      X      PIX
NIC1    SYS     SYS     PIX      X 

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

NIC Legend:

  NIC0: mlx5_0
  NIC1: mlx5_1

赞(0）回复(0）举报 5个月前

bttbmeg03#

vllm 0.4.1 + qwen-14b-chat the yaml as below:

apiVersion: apps/v1 
kind: Deployment
metadata:
  name: vllm
  labels:
    app: vllm
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      volumes:
      - name: model
        persistentVolumeClaim:
          claimName: llm-model
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "2Gi"
      containers:
      - name: vllm
        image: vllm/vllm-openai:0.4.1
        command:
        - "python3 -m vllm.entrypoints.openai.api_server  --port 8080  --trust-remote-code --served-model-name qwen --model /mnt/models/Qwen-14B-Chat --gpu-memory-utilization 0.95 --tensor-parallel-size 2"
        ports:
        - containerPort: 8080
        readinessProbe:
          tcpSocket:
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 30
        resources:
          limits:
            nvidia.com/gpu: "2"
          requests:
            cpu: 4
            memory: 8Gi
            nvidia.com/gpu: "2"
        volumeMounts:
        - mountPath: /mnt/models
          name: model
        - name: dshm
          mountPath: /dev/shm

赞(0）回复(0）举报 5个月前

dojqjjoe4#

好的，我按照你的示例进行了一些修改

apiVersion: apps/v1 
kind: Deployment
metadata:
  name: vllm
  labels:
    app: vllm
spec:
  replicas: 1
  revisionHistoryLimit: 1
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: vllm
  template:
    metadata:
      labels:
        app: vllm
    spec:
      volumes:
      - name: model
        persistentVolumeClaim:
          claimName: example-pv-filesystem
      - name: dshm
        emptyDir:
          medium: Memory
          sizeLimit: "15Gi"
      containers:
      - name: vllm
        image: vllm/vllm-openai:v0.4.1
        command:
        - "python3 -m vllm.entrypoints.openai.api_server  --port 8080  --trust-remote-code --served-model-name mistral --model /mnt/models/models--mistralai--Mixtral-8x7B-Instruct-v0.1 --gpu-memory-utilization 0.95 --tensor-parallel-size 2"
        ports:
        - containerPort: 8080
        readinessProbe:
          tcpSocket:
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 30
        resources:
          limits:
            nvidia.com/gpu: "2"
          requests:
            cpu: 4
            memory: 8Gi
            nvidia.com/gpu: "2"
        volumeMounts:
        - mountPath: /mnt/models
          name: model
        - name: dshm
          mountPath: /dev/shm

图片 vllm/vllm-openai:0.4.1 不存在，但 vllm/vllm-openai:v0.4.1 存在
应用程序无法启动，我得到了错误

Error: container create failed: time="2024-05-16T02:11:29Z" level=error msg="runc create failed: unable to start container process: exec: \"python3 -m vllm.entrypoints.openai.api_server  --port 8080  --trust-remote-code --served-model-name mistral --model /mnt/models/models--mistralai--Mixtral-8x7B-Instruct-v0.1 --gpu-memory-utilization 0.95 --tensor-parallel-size 2\": stat python3 -m vllm.entrypoints.openai.api_server  --port 8080  --trust-remote-code --served-model-name mistral --model /mnt/models/models--mistralai--Mixtral-8x7B-Instruct-v0.1 --gpu-memory-utilization 0.95 --tensor-parallel-size 2: no such file or directory"

但是当我注解掉命令并在pod内搜索 /mnt/models/models--mistralai--Mixtral-8x7B-Instruct-v0.1 时，我可以找到该目录

# pwd
/mnt/models/models--mistralai--Mixtral-8x7B-Instruct-v0.1
# ls
blobs  refs  snapshots
#

赞(0）回复(0）举报 5个月前

y0u0uwnf5#

我遇到了同样的问题

python3 -m vllm.entrypoints.openai.api_server  --model /model/model.file --port 8001 --trust-remote-code --gpu-memory-utilization 0.95: no such file or directory

@jayteaftw ,你在这方面有进展吗？

赞(0）回复(0）举报 5个月前

ou6hu8tu6#

我正在遇到相同的问题

python3 -m vllm.entrypoints.openai.api_server  --model /model/model.file --port 8001 --trust-remote-code --gpu-memory-utilization 0.95: no such file or directory

在这个问题上你有什么进展吗@jayteaftw ?

我不明白为什么我们会有同样的问题，但是的，我的这个问题仍然存在，即使是在4.0.3版本中。

赞(0）回复(0）举报 5个月前

xkrw2x1b7#

@jayteaftw
我看到RH有一个ubi vllm镜像，它对我有效。你可能也想试试这个。quay.io/rh-aiservices-bu/vllm-openai-ubi9:0.4.2
这将帮助你从huggingface下载镜像，所以对于你的情况，在container.args中设置--model mistralai/Mixtral-8x7B-Instruct-v0.1

赞(0）回复(0）举报 5个月前

2o7dmzc58#

@jayteaftw 我看到RH有一个ubi vllm镜像，它对我有效，你可能也想试试这个。quay.io/rh-aiservices-bu/vllm-openai-ubi9:0.4.2
这将帮助你从huggingface下载镜像，所以对于你的情况，在container.args中设置--model mistralai/Mixtral-8x7B-Instruct-v0.1
感谢你的建议。然而，当我尝试在Openshift上运行他们的镜像时，我仍然遇到了同样的问题。当使用超过1个GPU时，它会卡住。我甚至尝试从源代码编译并将其更改为使用新的0.4.3版本，但结果仍然是一样的

赞(0）回复(0）举报 5个月前

djp7away9#

遇到了与OP相同的问题，尝试过0.4.2和0.4.3版本，但都没有成功。希望VLLM能提供一些关于正确实现的反馈

赞(0）回复(0）举报 5个月前