pytorch cuda不可用:CUDA初始化:CUDA未知错误

cygmwpex  于 2022-11-09  发布在  其他
关注(0)|答案(1)|浏览(684)

我成功地运行了pytorch,但是在系统重新启动后,我在调用torch.cuda.is_available()时遇到以下错误:
UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1616554782469/work/c10/cuda/CUDAFunctions.cpp:109.)
nvidia-smi的输出:

nvidia-smi

Thu Jun 24 09:11:39 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05    Driver Version: 455.23.05    CUDA Version: 11.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  On   | 00000000:00:04.0 Off |                    0 |
| N/A   40C    P0    26W / 250W |      0MiB / 16280MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

环境信息:

python collect_env.py

Collecting environment information...
/lib/python3.9/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at  /opt/conda/conda-bld/pytorch_1616554782469/work/c10/cuda/CUDAFunctions.cpp:109.)
  return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.28
Python version: 3.9 (64-bit runtime)
Python platform: Linux-4.19.0-17-cloud-amd64-x86_64-with-glibc2.28
Is CUDA available: False
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB
Nvidia driver version: 455.23.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.8.1
[pip3] torchaudio==0.8.0a0+e4e171a
[pip3] torchmetrics==0.3.2
[pip3] torchvision==0.9.1
[conda] _tflow_select             2.3.0                       mkl  
[conda] blas                      1.0                         mkl    conda-forge
[conda] cudatoolkit               11.1.74              h6bb024c_0    nvidia
[conda] mkl                       2020.2                      256  
[conda] mkl-service               2.3.0            py39he8ac12f_0  
[conda] mkl_fft                   1.3.0            py39h54f3939_0  
[conda] mkl_random                1.0.2            py39h63df603_0  
[conda] numpy                     1.19.2           py39h89c1606_0  
[conda] numpy-base                1.19.2           py39h2ae0177_0  
[conda] pytorch                   1.8.1           py3.9_cuda11.1_cudnn8.0.5_0    pytorch
[conda] tensorflow                2.4.1           mkl_py39h4683426_0  
[conda] tensorflow-base           2.4.1           mkl_py39h43e0292_0  
[conda] torchaudio                0.8.1                      py39    pytorch
[conda] torchmetrics              0.3.2              pyhd8ed1ab_0    conda-forge
[conda] torchvision               0.9.1                py39_cu111    pytorch
lb3vh1jj

lb3vh1jj1#

我最近在将我的gpu容器从nvidia docker迁移到podman时遇到了这个错误。对我来说,根本原因是/dev/nvidia-uvm* 文件丢失了,而CUDA显然需要这些文件。请检查您是否有这些文件:


# ls -ld /dev/nvidia*

drwxr-x--- 2 root root       80 Oct  6 21:11 /dev/nvidia-caps
crw-rw-rw- 1 root root 195, 254 Oct  6 21:08 /dev/nvidia-modeset
crw-rw-rw- 1 root root 237,   0 Oct  6 21:13 /dev/nvidia-uvm       <-IMPORTANT
crw-rw-rw- 1 root root 237,   1 Oct  6 21:13 /dev/nvidia-uvm-tools <-IMPORTANT
crw-rw-rw- 1 root root 195,   0 Oct  6 21:08 /dev/nvidia0
crw-rw-rw- 1 root root 195, 255 Oct  6 21:08 /dev/nvidiactl

sudo nvidia-modprobe -c 0 -u应该加载内核模块并创建这些dev文件,如果你没有看到它们的话。或者从ubuntu中查找/sbin/create-uvm-dev-node脚本,他们创建了这些脚本来修复他们的问题。
如果你在一个容器/虚拟机中使用GPU,这些开发文件也需要存在于容器中。通常nvidia运行时脚本会负责传递它们。如果没有发生这种情况,你可以尝试传递一些显式的--device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools到docker/podman run。

相关问题