[Bug]: 使用vllm+ray分布式推理报错

vfh0ocws  于 2个月前  发布在  其他
关注(0)|答案(8)|浏览(28)

在使用Gloo进行全网格连接时遇到了问题,没有找到解决办法。脚本如下:

from vllm import LLM
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
outputs = llm.generate(prompts)
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")

报错信息如下:

[rank0]: Traceback (most recent call last):
[rank0]: File "/data/vllm_test.py", line 13, in <module>
[rank0]: llm = LLM(model="/mnt/d/llm/qwen/qwen1.5_0.5b", trust_remote_code=True, gpu_memory_utilization=0.4,enforce_eager=True,tensor_parallel_size=2,swap_space=1)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 144, in **init**
[rank0]: self.llm_engine = LLMEngine.from_engine_args(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 363, in from_engine_args
[rank0]: engine = cls(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in **init**
[rank0]: super().**init**(*args, **kwargs)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/executor_base.py", line 41, in **init**
[rank0]: self._init_executor()
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 40, in _init_executor
[rank0]: self._init_workers_ray(placement_group)
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 171, in _init_workers_ray
[rank0]: self._run_workers("init_device")
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/executor/ray_gpu_executor.py", line 246, in _run_workers
[rank0]: driver_worker_output = self.driver_worker.execute_method(
[rank0]: File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/vllm/worker/worker_base.py", line 149, in execute_method
[rank0]: raise e
[rank0]: File "/home/jky/miniconda3/envs/ray
5us2dqdw

5us2dqdw1#

我遇到了同样的问题。
我通过设置变量解决了这个问题:

os.environ['GLOO_SOCKET_IFNAME'] = 'ib0'
os.environ['TP_SOCKET_IFNAME'] = 'ib0'

此外,我还不得不从环境中删除这些变量:

http_proxy
https_proxy
ftp_proxy

ray==2.24.0
vllm==0.4.3

e4eetjau

e4eetjau2#

@thies1006,你试过https://docs.vllm.ai/en/latest/getting_started/debugging.html了吗?尤其是sanity check脚本?我猜它应该能解决你的问题。
我认为问题是由GLOO_SOCKET_IFNAME引起的。
TP_SOCKET_IFNAME是用于https://github.com/pytorch/tensorpipe的,而且gloo不应该使用http/https。

lsmepo6l

lsmepo6l3#

我遇到了同样的问题。我通过设置变量解决了它:

os.environ['GLOO_SOCKET_IFNAME'] = 'ib0'
os.environ['TP_SOCKET_IFNAME'] = 'ib0'

此外,我还必须从环境中删除这些变量:

http_proxy
https_proxy
ftp_proxy

ray==2.24.0 vllm==0.4.3
非常感谢您提供的解决办法,我尝试以后依旧是报这个错,我添加的是这两行代码,不知是否有问题os.environ['GLOO_SOCKET_IFNAME'] = 'eth0'
os.environ['TP_SOCKET_IFNAME'] = 'eth0'

vohkndzv

vohkndzv4#

@thies1006 您是否尝试过 https://docs.vllm.ai/en/latest/getting_started/debugging.html,尤其是健全性检查脚本?我认为它应该可以解决您的问题。
我相信这是由 造成的 GLOO_SOCKET_IFNAME
TP_SOCKET_IFNAME 适用于 https://github.com/pytorch/tensorpipe,并且 gloo 不应该使用 http/https。
非常感谢您提供的建议,我用脚本测试了GPU通信,发生了报错,应该就是这个导致实验无法继续的,不知道您是否有解决方向

import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
data = torch.FloatTensor([1,] * 128).to(f"cuda:{local_rank}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
assert value == dist.get_world_size()

报错如下:

[E socket.cpp:957] [c10d] The client socket has timed out after 60s while trying to connect to (192.168.41.79, 29502).
Traceback (most recent call last):
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 157, in _create_tcp_store
    store = TCPStore(
torch.distributed.DistNetworkError: The client socket has timed out after 60s while trying to connect to (192.168.41.79, 29502).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jky/miniconda3/envs/ray/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 235, in launch_agent
    rdzv_handler=rdzv_registry.get_rendezvous_handler(rdzv_parameters),
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66, in get_rendezvous_handler
    return handler_registry.create_handler(params)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/api.py", line 263, in create_handler
    handler = creator(params)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36, in _create_c10d_handler
    backend, store = create_backend(params)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 255, in create_backend
    store = _create_tcp_store(params)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 181, in _create_tcp_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
ru9i0ody

ru9i0ody5#

你好@JKYtydt
以下是我在两个节点上运行的步骤。

  • 首先确保ray没有运行(在所有节点上运行ray stop)
  • export GLOO_SOCKET_IFNAME='<device_name>' (不确定,但我认为只需要在主节点上设置这个就足够了)
  • 在所有节点上启动ray,例如ray start --head --node-ip-address <ip_address head>(为主节点)和ray start --address='<ip_address head>'(为其他节点)
  • 启动vllm:python -m vllm.entrypoints.openai.api_server --model <model_name> --tensor-parallel-size <world_size>
9udxz4iz

9udxz4iz6#

您好,我现在已经尝试了所有我能找到的解决方法,但仍然无法解决这个问题。不知道您是否还有其他解决方案或者需要提供哪些其他信息来帮助我解决这个问题。非常感谢您的帮助。

我是在两台Windows系统下的电脑进行的,都是Ubuntu系统。两个节点之间可以互相ping通。根据您第一次给我的回复,应该是两个节点的GPU无法通信。不知道这对您是否有帮助。

kzmpq1sx

kzmpq1sx7#

非常感谢您的回复,但我仍然无法解决这个问题。

twh00eeo

twh00eeo8#

在使用这个脚本测试,发现了另一个问题,如果没有设置--rdzv_backend=c10d这个参数,执行以下命令,报错如下

import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl")
local_rank = dist.get_rank() % torch.cuda.device_count()
data = torch.FloatTensor([1,] * 128).to(f"cuda:{local_rank}")
dist.all_reduce(data, op=dist.ReduceOp.SUM)
torch.cuda.synchronize()
value = data.mean().item()
assert value == dist.get_world_size()
torchrun --nnodes 2 --nproc-per-node 1 --rdzv_endpoint=192.168.41.79:29502 test.py
Traceback (most recent call last):
  File "/home/jky/miniconda3/envs/ray/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 879, in main
    run(args)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/run.py", line 870, in run
    elastic_launch(
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 254, in launch_agent
    result = agent.run()
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 733, in run
    result = self._invoke_run(role)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 870, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 548, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/home/jky/miniconda3/envs/ray/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
torch.distributed.DistStoreError: Timed out after 901 seconds waiting for clients. 1/2 clients joined.

相关问题