我看过多个关于的问题:
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1614378083779/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
但似乎没有人能帮我解决这个问题:
- https://github.com/pytorch/pytorch/issues/54550
- https://github.com/pytorch/pytorch/issues/47885
- https://github.com/pytorch/pytorch/issues/50921
- https://github.com/pytorch/pytorch/issues/54823
我试过在每个脚本的开头手动执行torch.cuda.set_device(device)
。这似乎对我不起作用。我试过不同的GPU。我试过降级pytorch版本和cuda版本。1.6.0,1.7.1,1.8.0和cuda 10.2,11.0,11.1的不同组合。我不确定还能做什么。人们做了什么来解决这个问题?
也许很有关系?
更完整的错误消息:
('jobid', 4852)
('slurm_jobid', -1)
('slurm_array_task_id', -1)
('condor_jobid', 4852)
('current_time', 'Mar25_16-27-35')
('tb_dir', PosixPath('/home/miranda9/data/logs/logs_Mar25_16-27-35_jobid_4852/tb'))
('gpu_name', 'GeForce GTX TITAN X')
('PID', '30688')
torch.cuda.device_count()=2
opts.world_size=2
ABOUT TO SPAWN WORKERS
done setting sharing strategy...next mp.spawn
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 1
INFO:root:Added key: store_based_barrier_key:1 to store for rank: 0
rank=0
mp.current_process()=<SpawnProcess name='SpawnProcess-1' parent=30688 started>
os.getpid()=30704
setting up rank=0 (with world_size=2)
MASTER_ADDR='127.0.0.1'
59264
backend='nccl'
--> done setting up rank=0
setup process done for rank=0
Traceback (most recent call last):
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 279, in <module>
main_distributed()
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 188, in main_distributed
spawn_return = mp.spawn(fn=train, args=(opts,), nprocs=opts.world_size)
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 230, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method='spawn')
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 188, in start_processes
while not context.join():
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 150, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 0 terminated with the following error:
Traceback (most recent call last):
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/multiprocessing/spawn.py", line 59, in _wrap
fn(i, *args)
File "/home/miranda9/ML4Coq/ml4coq-proj/embeddings_zoo/tree_nns/main_brando.py", line 212, in train
tactic_predictor = move_to_ddp(rank, opts, tactic_predictor)
File "/home/miranda9/ultimate-utils/ultimate-utils-project/uutils/torch/distributed.py", line 162, in move_to_ddp
model = DistributedDataParallel(model, find_unused_parameters=True, device_ids=[opts.gpu])
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 446, in __init__
self._sync_params_and_buffers(authoritative_rank=0)
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 457, in _sync_params_and_buffers
self._distributed_broadcast_coalesced(
File "/home/miranda9/miniconda3/envs/metalearning11.1/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1155, in _distributed_broadcast_coalesced
dist._broadcast_coalesced(
RuntimeError: NCCL error in: /opt/conda/conda-bld/pytorch_1616554793803/work/torch/lib/c10d/ProcessGroupNCCL.cpp:825, unhandled cuda error, NCCL version 2.7.8
ncclUnhandledCudaError: Call to CUDA function failed.
奖励1:
我还有错误:
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
Traceback (most recent call last):
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1423, in <module>
main()
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1365, in main
train(args=args)
File "/home/miranda9/diversity-for-predictive-success-of-meta-learning/div_src/diversity_src/experiment_mains/main_dist_maml_l2l.py", line 1385, in train
args.opt = move_opt_to_cherry_opt_and_sync_params(args) if is_running_parallel(args.rank) else args.opt
File "/home/miranda9/ultimate-utils/ultimate-utils-proj-src/uutils/torch_uu/distributed.py", line 456, in move_opt_to_cherry_opt_and_sync_params
args.opt = cherry.optim.Distributed(args.model.parameters(), opt=args.opt, sync=syn)
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/cherry/optim.py", line 62, in __init__
self.sync_parameters()
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/cherry/optim.py", line 78, in sync_parameters
dist.broadcast(p.data, src=root)
File "/home/miranda9/miniconda3/envs/meta_learning_a100/lib/python3.9/site-packages/torch/distributed/distributed_c10d.py", line 1090, in broadcast
work = default_pg.broadcast([tensor], opts)
RuntimeError: NCCL error in: ../torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
其中一个答案建议使用nvcca & pytorch.version.cuda进行匹配,但它们没有:
(meta_learning_a100) [miranda9@hal-dgx ~]$ python -c "import torch;print(torch.version.cuda)"
11.1
(meta_learning_a100) [miranda9@hal-dgx ~]$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2020 NVIDIA Corporation
Built on Wed_Jul_22_19:09:09_PDT_2020
Cuda compilation tools, release 11.0, V11.0.221
Build cuda_11.0_bu.TC445_37.28845127_0
如何匹配它们?
4条答案
按热度按时间jei2mxaa1#
我有正确的cuda安装的意思:
和
给出了nccl的某个版本(例如,2.10.3)
修复方法是删除nccl:
然后libnccl版本检查没有给出任何版本,但是ddp训练工作正常!
evrscar22#
这不是一个很令人满意的答案,但这似乎是什么结束了为我工作。我只是使用pytorch 1.7.1和它的cuda版本10.2。只要cuda 11.0是加载它似乎是工作。要安装该版本做:
如果你是在一个HPC中,请执行
module avail
以确保加载了正确的cuda版本。也许你需要为提交作业提供bash和其他东西。我的设置如下所示:我还重复了其他有用的东西,如nvcc版本,以确保加载工作(注意,nvidia-smi的顶部没有显示正确的cuda版本)。
注我认为这可能只是一个bug,因为cuda 11. 1 + pytorch 1. 8. 1在撰写本文时是新的。我确实试过
但我不能说它总是工作,或者为什么它不工作。我在我目前的代码中确实有它,但我认为我仍然得到错误的pytorch1.8.x + cuda11.x。
看看我的conda列表,如果有帮助的话:
对于a100来说,这似乎在某个时候起作用了:
wfveoks03#
您应在https://pytorch.org/get-started/locally/处获得答案
对我来说,它的工作设置如下:
pip 3安装 Torch Torch 视觉 Torch 音频--额外索引-url https://download.pytorch.org/whl/cu116
ar7v8xwq4#
正如在相关问题Pytorch "NCCL error": unhandled system error, NCCL version 2.4.8"中所讨论的,
unhandled cuda error, NCCL version ...
表示NCCL端出现了问题。您需要设置一个环境变量NCCL_DEBUG=INFO
来要求NCCL打印其日志,这样您就可以找出问题的确切原因。(提示:查找NCCL日志中的第一个WARN
行)。至于OP的问题,很可能是由于
driver version / cuda version / cuda version pytorch is compiled with
之间的不匹配引起的。在这种情况下,如果您检查NCCL日志,它将显示如下内容:这就是为什么我们在调试
unhandled cuda error
时需要使用NCCL_DEBUG=INFO
。更新:
问:如何设置
NCCL_DEBUG=INFO
?答:备选方案1:将
NCCL_DEBUG=INFO
置于命令行前面。例如NCCL_DEBUG=INFO python yourscript.py
。选项2:在Python脚本中设置。例如,
选项3:在shell中设置它。例如,
export NCCL_DEBUG=INFO
Q:如何匹配CUDA和Pytorch的版本?
答:OP似乎使用的是CUDA 11.0。这有点棘手,因为Pytorch不再提供CUDA 11.0的预构建包。所以你需要使用旧的Pytorch预构建包(我认为CUDA 11.0的最新版本是Pytorch 1.7.1)或者更新你的系统CUDA版本。或者你可以尝试从源代码构建Pytorch。
如果你能接受一个旧的Pytorch。