vllm [Bug]: Ray内存泄漏

ars1skjm  于 4个月前  发布在  其他
关注(0)|答案(7)|浏览(101)

当前环境

PyTorch version: 2.1.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.6 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0
Clang version: Could not collect
CMake version: version 3.29.2
Libc version: glibc-2.31

Python version: 3.11.3 (main, Apr 19 2024, 17:22:27) [GCC 9.4.0] (64-bit runtime)
Python platform: Linux-5.4.0-177-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 10.1.243
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A40
GPU 1: NVIDIA A40
GPU 2: NVIDIA A40

Nvidia driver version: 535.161.08
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      40 bits physical, 48 bits virtual
CPU(s):                             64
On-line CPU(s) list:                0-63
Thread(s) per core:                 1
Core(s) per socket:                 1
Socket(s):                          64
NUMA node(s):                       1
Vendor ID:                          AuthenticAMD
CPU family:                         25
Model:                              1
Model name:                         AMD EPYC-Milan Processor
Stepping:                           1
CPU MHz:                            2994.374
BogoMIPS:                           5988.74
Virtualization:                     AMD-V
Hypervisor vendor:                  KVM
Virtualization type:                full
L1d cache:                          2 MiB
L1i cache:                          2 MiB
L2 cache:                           32 MiB
L3 cache:                           2 GiB
NUMA node0 CPU(s):                  0-63
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm rep_good nopl cpuid extd_apicid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm cmp_legacy svm cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw topoext perfctr_core ssbd ibrs ibpb stibp vmmcall fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves clzero xsaveerptr wbnoinvd arat npt nrip_save umip vaes vpclmulqdq rdpid arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] nvidia-nccl-cu12==2.18.1
[pip3] torch==2.1.2
[pip3] triton==2.1.0
[conda] Could not collectROCM Version: Could not collect
Neuron SDK Version: N/A
vLLM Version: 0.4.0.post1
vLLM Build Flags:
CUDA Archs: Not Set; ROCm: Disabled; Neuron: Disabled
GPU Topology:
GPU0	GPU1	GPU2	CPU Affinity	NUMA Affinity	GPU NUMA ID
GPU0	 X 	PHB	PHB	0-63	0		N/A
GPU1	PHB	 X 	PHB	0-63	0		N/A
GPU2	PHB	PHB	 X 	0-63	0		N/A

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

🐛 描述bug

我正在使用vLLM,在同一Python会话中(一次一个模型),并在多GPU设置上运行。每次模型运行后,我需要清除GPU内存以为下一个模型腾出空间,这意味着(除了其他事情之外)我需要通过 ray.shutdown() 关闭Ray集群。这没问题,但这只清除了一个GPU的GPU内存。
最小示例:

from vllm import LLM
import ray

# Instantiate first model
llm = LLM("mhenrichsen/danskgpt-tiny", tensor_parallel_size=2)

# Destroy Ray cluster; this only clears the GPU memory on one of the GPUs
# Note that adding any combination of `torch.cuda.empty_cache()`, 
# `gc.collect()` or `destroy_model_parallel()` doesn't help here
ray.shutdown()

# Instantiate second model; this now causes OOM errors
llm = LLM("mhenrichsen/danskgpt-tiny-chat", tensor_parallel_size=2)

This is a known Ray issue ,其中该问题中提到的解决方案以及官方Ray文档建议在 ray.remote 调用中包含 max_calls=1 ,这据说可以解决问题。在vLLM中,这些行位于 vllm.executor.ray_gpu_executor 模块和 vllm.engine.async_llm_engine 模块中。然而,在那些应用程序中,装饰器是 Package 一个类(Ray中的"actor"),其中不允许使用 max_calls 参数,所以我不确定这个解决方案是否适用于这里。

c2e8gylq

c2e8gylq1#

实际上,我认为这不是Ray的问题,而是来自CUDA本身(即CUDA的缓存没有被清理)。你可以尝试在新的初始化之前调用这个。

当你调用ray.shutdown时,它会杀死使用GPU的ray worker进程,并进行清理。但在vLLM中,驱动程序(你的Python脚本)正在使用第一个GPU,除非你调用这两个API,否则它不会自动清理。

你好@rkooo567。我刚刚尝试了你的解决方案,不幸的是,它仍然无法从第一个GPU清除GPU内存。

x33g5p2x

x33g5p2x2#

我明白了。让我尽快尝试重现它。还有一件事,你能试着在tensor_parallel_size=1的情况下运行它吗?看看清理是否正在进行(有无空缓存)?

7ajki6be

7ajki6be3#

还有一件事,你能尝试使用tensor_parallel_size=1运行它吗?看看清理是否正在进行?
在这种情况下,一切都很好,GPU内存已完全重置,重新初始化LLM示例不是问题。

1u4esq0p

1u4esq0p4#

你有没有收到任何关于这个的新闻?

ldxq2e6h

ldxq2e6h5#

对不起,我还没有时间去查看这个问题。我可能在几周后会处理它。

643ylb08

643ylb086#

在多GPU部署过程中也会出现相同的情况。

tuwxkamq

tuwxkamq7#

实际上,我认为这不是Ray的问题,而是来自CUDA本身(即CUDA的缓存没有被清理)。你可以尝试在新的初始化之前调用这个。

当你调用ray.shutdown时,它会杀死使用GPU的ray worker进程,并进行清理。但在vLLM中,驱动程序(你的Python脚本)正在使用第一个GPU,除非你调用这两个API,否则它不会自动被清理。

相关问题