当前环境
PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.31
Python version: 3.11.3 (main, Apr 19 2024, 17:22:27) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-177-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 10.1.243
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A40
GPU 1: NVIDIA A40
GPU 2: NVIDIA A40
Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 40 bits physical, 48 bits virtual
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 64
NUMA node(s): 1
Vendor ID: AuthenticAMD
CPU family: 25
Model: 1
Model name: AMD EPYC-Milan Processor
Stepping: 1
CPU MHz: 2994.374
BogoMIPS: 5988.74
Virtualization: AMD-V
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 2 MiB
L1i cache: 2 MiB
L2 cache: 32 MiB
L3 cache: 2 GiB
NUMA node0 CPU(s): 0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip vaes vpclmulqdq rdpid arch_capabilities
Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0 GPU1 GPU2 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X PHB PHB 0-63 0 N/A
GPU1 PHB X PHB 0-63 0 N/A
GPU2 PHB PHB X 0-63 0 N/A
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
🐛 描述bug
我正在使用vLLM,在同一Python会话中(一次一个模型),并在多GPU设置上运行。每次模型运行后,我需要清除GPU内存以为下一个模型腾出空间,这意味着(除了其他事情之外)我需要通过 ray.shutdown()
关闭Ray集群。这没问题,但这只清除了一个GPU的GPU内存。
最小示例:
from vllm import LLM
import ray
# Instantiate first model
llm = LLM("mhenrichsen/danskgpt-tiny", tensor_parallel_size=2)
# Destroy Ray cluster; this only clears the GPU memory on one of the GPUs
# Note that adding any combination of `torch.cuda.empty_cache()`,
# `gc.collect()` or `destroy_model_parallel()` doesn't help here
ray.shutdown()
# Instantiate second model; this now causes OOM errors
llm = LLM("mhenrichsen/danskgpt-tiny-chat", tensor_parallel_size=2)
This is a known Ray issue ,其中该问题中提到的解决方案以及官方Ray文档建议在 ray.remote
调用中包含 max_calls=1
,这据说可以解决问题。在vLLM中,这些行位于 vllm.executor.ray_gpu_executor 模块和 vllm.engine.async_llm_engine 模块中。然而,在那些应用程序中,装饰器是 Package 一个类(Ray中的"actor"),其中不允许使用 max_calls
参数,所以我不确定这个解决方案是否适用于这里。
7条答案
按热度按时间c2e8gylq1#
实际上,我认为这不是Ray的问题,而是来自CUDA本身(即CUDA的缓存没有被清理)。你可以尝试在新的初始化之前调用这个。
当你调用ray.shutdown时,它会杀死使用GPU的ray worker进程,并进行清理。但在vLLM中,驱动程序(你的Python脚本)正在使用第一个GPU,除非你调用这两个API,否则它不会自动清理。
你好@rkooo567。我刚刚尝试了你的解决方案,不幸的是,它仍然无法从第一个GPU清除GPU内存。
x33g5p2x2#
我明白了。让我尽快尝试重现它。还有一件事,你能试着在tensor_parallel_size=1的情况下运行它吗?看看清理是否正在进行(有无空缓存)?
7ajki6be3#
还有一件事,你能尝试使用tensor_parallel_size=1运行它吗?看看清理是否正在进行?
在这种情况下,一切都很好,GPU内存已完全重置,重新初始化LLM示例不是问题。
1u4esq0p4#
你有没有收到任何关于这个的新闻?
ldxq2e6h5#
对不起,我还没有时间去查看这个问题。我可能在几周后会处理它。
643ylb086#
在多GPU部署过程中也会出现相同的情况。
tuwxkamq7#
实际上,我认为这不是Ray的问题,而是来自CUDA本身(即CUDA的缓存没有被清理)。你可以尝试在新的初始化之前调用这个。
当你调用ray.shutdown时,它会杀死使用GPU的ray worker进程,并进行清理。但在vLLM中,驱动程序(你的Python脚本)正在使用第一个GPU,除非你调用这两个API,否则它不会自动被清理。