当前环境

PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.31

Python version: 3.11.3 (main, Apr 19 2024, 17:22:27) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-177-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 10.1.243
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A40
GPU 1: NVIDIA A40
GPU 2: NVIDIA A40

Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      40 bits physical, 48 bits virtual
CPU(s):                             64
On-line CPU(s) list:                0-63
Thread(s) per core:                 1
Core(s) per socket:                 1
Socket(s):                          64
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              1
Model name:                         AMD EPYC-Milan Processor
Stepping:                           1
CPU MHz:                            2994.374
BogoMIPS:                           5988.74
Virtualization:                     AMD-V
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          2 MiB
L1i cache:                          2 MiB
L2 cache:                           32 MiB
L3 cache:                           2 GiB
NUMA node0 CPU(s):                  0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip vaes vpclmulqdq rdpid arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	GPU1	GPU2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	PHB	0-63	0		N/A
GPU1	PHB	 X 	PHB	0-63	0		N/A
GPU2	PHB	PHB	 X 	0-63	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 描述bug

我正在使用vLLM,在同一Python会话中(一次一个模型),并在多GPU设置上运行。每次模型运行后，我需要清除GPU内存以为下一个模型腾出空间，这意味着(除了其他事情之外)我需要通过 ray.shutdown() 关闭Ray集群。这没问题，但这只清除了一个GPU的GPU内存。
最小示例：

from vllm import LLM
import ray

# Instantiate first model
llm = LLM("mhenrichsen/danskgpt-tiny", tensor_parallel_size=2)

# Destroy Ray cluster; this only clears the GPU memory on one of the GPUs
# Note that adding any combination of `torch.cuda.empty_cache()`, 
# `gc.collect()` or `destroy_model_parallel()` doesn't help here
ray.shutdown()

# Instantiate second model; this now causes OOM errors
llm = LLM("mhenrichsen/danskgpt-tiny-chat", tensor_parallel_size=2)

This is a known Ray issue ,其中该问题中提到的解决方案以及官方Ray文档建议在 ray.remote 调用中包含 max_calls=1 ,这据说可以解决问题。在vLLM中，这些行位于 vllm.executor.ray_gpu_executor 模块和 vllm.engine.async_llm_engine 模块中。然而，在那些应用程序中，装饰器是 Package 一个类(Ray中的"actor"),其中不允许使用 max_calls 参数，所以我不确定这个解决方案是否适用于这里。