我正在Docker容器上练习PyTorch的多节点DDP,当我运行时,
torchrun \
--nnodes=1 \
--node_rank=0 \
--nproc_per_node=gpu \
--rdzv_id=123 \
--rdzv-backend=c10d \
--rdzv-endpoint=localhost:10000 \
test_code.py
然而,当我跑步的时候
torchrun \
--nnodes=1 \
--node_rank=0 \
--nproc_per_node=gpu \
--rdzv_id=1024 \
--rdzv-backend=c10d \
--rdzv-endpoint=192.168.9.225:10000 \
07-5-pytorch-ddp-multiple-nodes.py
它卡住,然后发生如下错误
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[E socket.cpp:860] [c10d] The client socket has timed out after 60s while trying to connect to (192.168.9.225, 10000).
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 155, in _create_tcp_store
store = TCPStore(
TimeoutError: The client socket has timed out after 60s while trying to connect to (192.168.9.225, 10000).
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/bin/torchrun", line 33, in <module>
sys.exit(load_entry_point('torch==2.0.1', 'console_scripts', 'torchrun')())
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 223, in launch_agent
rdzv_handler=rdzv_registry.get_rendezvous_handler(rdzv_parameters),
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 65, in get_rendezvous_handler
return handler_registry.create_handler(params)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/api.py", line 257, in create_handler
handler = creator(params)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36, in _create_c10d_handler
backend, store = create_backend(params)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 250, in create_backend
store = _create_tcp_store(params)
File "/opt/conda/lib/python3.10/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 175, in _create_tcp_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
我的Docker容器是由
docker run -it --gpus=all --ipc=host --network=host --cap-add=NET_ADMIN --name=pytorch-2.0-examples -v=pytorch-2.0-examples:/pytorch-2.0-examples pytorch/pytorch /bin/bash
并且ping测试正常,则测试时禁用防火墙。
如何使用非127.0.0.1?的IP地址运行PyTorch torchrun
当--rdzv-endpoint
是localhost或127.0.0.1时,我的程序运行良好,但当我机器的其他IP地址以192或172开头时,就不行了。
1条答案
按热度按时间ukqbszuj1#
它运行良好后,我添加了对等服务器IP和主机名称(/etc/主机的Ubuntu)文件