在WSL2上运行时出现问题,可能是与NCCL相关的?(line misc/cudawrap.cc:179 NCCL WARN Failed to find CUDA library libcuda.so (NCCL_CUDA_PATH='') : libcuda.so: cannot open shared object file: No such file or directory
)
- 问题出现在
--tensor-parallel-size 2
上,vllm基本上只对1个GPU有效。
大部分错误:
INFO 04-27 01:36:40 utils.py:608] Found nccl from library /home/ch/.config/vllm/nccl/cu12/libnccl.so.2.18.1
(RayWorkerWrapper pid=212395) INFO 04-27 01:36:40 utils.py:608] Found nccl from library /home/ch/.config/vllm/nccl/cu12/libnccl.so.2.18.1
WARNING 04-27 01:36:40 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
(RayWorkerWrapper pid=212395) WARNING 04-27 01:36:40 utils.py:414] Using 'pin_memory=False' as WSL is detected. This may slow down the performance.
INFO 04-27 01:36:40 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
INFO 04-27 01:36:40 selector.py:33] Using XFormers backend.
(RayWorkerWrapper pid=212395) INFO 04-27 01:36:40 selector.py:77] Cannot use FlashAttention backend because the flash_attn package is not found. Please install it for better performance.
(RayWorkerWrapper pid=212395) INFO 04-27 01:36:40 selector.py:33] Using XFormers backend.
INFO 04-27 01:36:41 pynccl_utils.py:43] vLLM is using nccl==2.18.1
00127-desktop:212030:212030 [0] NCCL INFO Bootstrap : Using eth0:172.18.78.11<0>
00127-desktop:212030:212030 [0] NCCL INFO NET/Plugin : Plugin load (libnccl-net.so) returned 2 : libnccl-net.so: cannot open shared object file: No such file or directory
00127-desktop:212030:212030 [0] NCCL INFO NET/Plugin : No plugin found, using internal implementation
00127-desktop:212030:212030 [0] misc/cudawrap.cc:179 NCCL WARN Failed to find CUDA library libcuda.so (NCCL_CUDA_PATH='') : libcuda.so: cannot open shared object file: No such file or directory
NCCL version 2.18.1+cuda12.0
(RayWorkerWrapper pid=212395) INFO 04-27 01:36:41 pynccl_utils.py:43] vLLM is using nccl==2.18.1
00127-desktop:212030:212030 [0] NCCL INFO NET/IB : No device found.
00127-desktop:212030:212030 [0] NCCL INFO NET/Socket : Using [0]eth0:172.18.78.11<0>
00127-desktop:212030:212030 [0] NCCL INFO Using network Socket
00127-desktop:212030:212030 [0] NCCL INFO Channel 00/02 : 0 1
00127-desktop:212030:212030 [0] NCCL INFO Channel 01/02 : 0 1
00127-desktop:212030:212030 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1
00127-desktop:212030:212030 [0] NCCL INFO P2P Chunksize set to 131072
00127-desktop:212030:212030 [0] NCCL INFO Channel 00 : 0[1000] -> 1[5000] via SHM/direct/direct
00127-desktop:212030:212030 [0] NCCL INFO Channel 01 : 0[1000] -> 1[5000] via SHM/direct/direct
00127-desktop:212030:212030 [0] transport.cc:154 NCCL WARN Cuda failure 'invalid argument'
00127-desktop:212030:212030 [0] NCCL INFO init.cc:1032 -> 1
00127-desktop:212030:212030 [0] NCCL INFO init.cc:1309 -> 1
00127-desktop:212030:212030 [0] NCCL INFO init.cc:1549 -> 1
00127-desktop:212030:212030 [0] NCCL INFO init.cc:1587 -> 1
ERROR 04-27 01:36:41 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
ERROR 04-27 01:36:41 worker_base.py:157] Traceback (most recent call last):
ERROR 04-27 01:36:41 worker_base.py:157] File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
ERROR 04-27 01:36:41 worker_base.py:157] return executor(*args, **kwargs)
ERROR 04-27 01:36:41 worker_base.py:157] File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 110, in init_device
ERROR 04-27 01:36:41 worker_base.py:157] init_worker_distributed_environment(self.parallel_config, self.rank,
ERROR 04-27 01:36:41 worker_base.py:157] File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 303, in init_worker_distributed_environment
ERROR 04-27 01:36:41 worker_base.py:157] pynccl_utils.init_process_group()
ERROR 04-27 01:36:41 worker_base.py:157] File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
ERROR 04-27 01:36:41 worker_base.py:157] comm = NCCLCommunicator(group=group)
ERROR 04-27 01:36:41 worker_base.py:157] File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
ERROR 04-27 01:36:41 worker_base.py:157] NCCL_CHECK(
ERROR 04-27 01:36:41 worker_base.py:157] File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
ERROR 04-27 01:36:41 worker_base.py:157] raise RuntimeError(f"NCCL error: {error_str}")
ERROR 04-27 01:36:41 worker_base.py:157] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] Error executing method init_device. This might cause deadlock in distributed execution.
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] Traceback (most recent call last):
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] return executor(*args, **kwargs)
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 110, in init_device
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] init_worker_distributed_environment(self.parallel_config, self.rank,
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 303, in init_worker_distributed_environment
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] pynccl_utils.init_process_group()
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] comm = NCCLCommunicator(group=group)
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] NCCL_CHECK(
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] raise RuntimeError(f"NCCL error: {error_str}")
(RayWorkerWrapper pid=212395) ERROR 04-27 01:36:41 worker_base.py:157] RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ch/.local/lib/python3.10/site-packages/vllm/entrypoints/openai/api_server.py", line 159, in <module>
engine = AsyncLLMEngine.from_engine_args(
File "/home/ch/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 361, in from_engine_args
engine = cls(
File "/home/ch/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 319, in __init__
self.engine = self._init_engine(*args, **kwargs)
File "/home/ch/.local/lib/python3.10/site-packages/vllm/engine/async_llm_engine.py", line 437, in _init_engine
return engine_class(*args, **kwargs)
File "/home/ch/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 148, in __init__
self.model_executor = executor_class(
File "/home/ch/.local/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 382, in __init__
super().__init__(*args, **kwargs)
File "/home/ch/.local/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in __init__
self._init_executor()
File "/home/ch/.local/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 45, in _init_executor
self._init_workers_ray(placement_group)
File "/home/ch/.local/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 181, in _init_workers_ray
self._run_workers("init_device")
File "/home/ch/.local/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 318, in _run_workers
driver_worker_output = self.driver_worker.execute_method(
File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 158, in execute_method
raise e
File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
return executor(*args, **kwargs)
File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 110, in init_device
init_worker_distributed_environment(self.parallel_config, self.rank,
File "/home/ch/.local/lib/python3.10/site-packages/vllm/worker/worker.py", line 303, in init_worker_distributed_environment
pynccl_utils.init_process_group()
File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl_utils.py", line 44, in init_process_group
comm = NCCLCommunicator(group=group)
File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 256, in __init__
NCCL_CHECK(
File "/home/ch/.local/lib/python3.10/site-packages/vllm/distributed/device_communicators/pynccl.py", line 72, in NCCL_CHECK
raise RuntimeError(f"NCCL error: {error_str}")
RuntimeError: NCCL error: unhandled cuda error (run with NCCL_DEBUG=INFO for details)
*** SIGSEGV received at time=1714207001 on cpu 0 ***
PC: @ 0x7f633b07e905 (unknown) ncclProxyService()
@ 0x7f6520405520 (unknown) (unknown)
[2024-04-27 01:36:41,979 E 212030 212471] logging.cc:361: *** SIGSEGV received at time=1714207001 on cpu 0 ***
[2024-04-27 01:36:41,979 E 212030 212471] logging.cc:361: PC: @ 0x7f633b07e905 (unknown) ncclProxyService()
[2024-04-27 01:36:41,979 E 212030 212471] logging.cc:361: @ 0x7f6520405520 (unknown) (unknown)
Fatal Python error: Segmentation fault```
2条答案
按热度按时间wgeznvg71#
在WSL2上运行时出现了一个问题,可能是与NCCL相关的?(misc/cudawrap.cc:179 NCCL警告无法找到CUDA库libcuda.so(NCCL_CUDA_PATH=''):libcuda.so:无法打开共享对象文件:没有这样的文件或目录)
看起来NCCL找不到libcuda.so。请尝试按照指南操作,并手动使用
NCCL_CUDA_PATH
指向路径?nuypyhwy2#
在WSL2上运行时出现了一个问题,可能是与NCCL相关的?(misc/cudawrap.cc:179 NCCL WARN Failed to find CUDA library libcuda.so (NCCL_CUDA_PATH=''):libcuda.so:无法打开共享对象文件:没有这样的文件或目录)
看起来NCCL找不到libcuda.so。尝试按照指南,手动使用NCCL_CUDA_PATH指向路径?
有道理。我愿意尝试这个并采取一些步骤来尝试修复。
但我认为我在WSL2上运行的是一个相当直接的原生态安装。
如果这不是用户错误,我想知道WSL2是否不鼓励使用或者实际上不受支持(这将有助于了解需要投入多少精力)。