pytorch 错误:某些NCCL操作已失败或超时

ybzsozfc  于 2023-01-09  发布在  其他
关注(0)|答案(3)|浏览(1794)

在4个A6000 GPU上运行分布式培训时,我收到以下错误:

[E ProcessGroupNCCL.cpp:630] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1803710 milliseconds before timing out.       
                                                                                                                                                        [E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.                                                                                 

terminate called after throwing an instance of 'std::runtime_error'                                                                                                        
what():  [Rank 2] Watchdog caught collective operation timeout: 
WorkNCCL(OpType=BROADCAST, Timeout(ms)=1800000) ran for 1804406 milliseconds before timing out.        

[E ProcessGroupNCCL.cpp:390] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.

我使用标准的NVidia PyTorch Docker。有趣的是,训练对于小数据集工作正常,但对于较大的数据集,我得到这个错误。所以我可以确认训练代码是正确的,确实工作。
没有实际的运行时错误或任何其他信息来获取任何地方的实际错误消息。

oxcyiej7

oxcyiej71#

以下两种方法解决了该问题:

  • 将CUDA的默认SHM(共享内存)增加到10g(我认为1g也可以)。你可以在docker运行命令中通过传递--shm-size=10g来完成这一操作。我也传递了--ulimit memlock=-1
  • export NCCL_P2P_LEVEL=NVL.
    • 调试提示**

检查当前SHM,

df -h
# see the row for shm

要查看NCCL调试消息:

export NCCL_DEBUG=INFO

针对GPU到GPU通信链路运行p2p带宽测试:

cd /usr/local/cuda/samples/1_Utilities/p2pBandwidthLatencyTest
sudo make
./p2pBandwidthLatencyTest

对于A6000 4 GPU Package 盒,将打印:

该矩阵显示了每对GPU之间的带宽,使用P2P时,带宽应该很高。

46qrfjad

46qrfjad2#

https://pytorch.org/docs/stable/distributed.html#torch.distributed.init_process_group
在torch.distributed.init_process_group()中设置超时参数,默认值为30分钟

torch.distributed.init_process_group(backend, init_method=None, timeout=datetime.timedelta(seconds=1800), world_size=- 1, rank=- 1, store=None, group_name='', pg_options=None)
vql8enpb

vql8enpb3#

对我来说,问题是PyTorch 1.10.1的torchrun命令。我只需要切换到python -m torch.distributed.launch命令,一切都正常了。我花了很多时间在StackOverflow和PyTorch论坛上,但没有人提到这个解决方案,所以我分享它是为了保存人们的时间。
torchrun似乎在PyTorch 1.11及更高版本中运行良好。

相关问题