我成功地运行了pytorch,但是在系统重新启动后,我在调用torch.cuda.is_available()时遇到以下错误:UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1616554782469/work/c10/cuda/CUDAFunctions.cpp:109.)
nvidia-smi的输出:
nvidia-smi
Thu Jun 24 09:11:39 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.23.05 Driver Version: 455.23.05 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 Tesla P100-PCIE... On | 00000000:00:04.0 Off | 0 |
| N/A 40C P0 26W / 250W | 0MiB / 16280MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
环境信息:
python collect_env.py
Collecting environment information...
/lib/python3.9/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /opt/conda/conda-bld/pytorch_1616554782469/work/c10/cuda/CUDAFunctions.cpp:109.)
return torch._C._cuda_getDeviceCount() > 0
PyTorch version: 1.8.1
Is debug build: False
CUDA used to build PyTorch: 11.1
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 8.3.0-6) 8.3.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.28
Python version: 3.9 (64-bit runtime)
Python platform: Linux-4.19.0-17-cloud-amd64-x86_64-with-glibc2.28
Is CUDA available: False
CUDA runtime version: Could not collect
GPU models and configuration: GPU 0: Tesla P100-PCIE-16GB
Nvidia driver version: 455.23.05
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.2
[pip3] torch==1.8.1
[pip3] torchaudio==0.8.0a0+e4e171a
[pip3] torchmetrics==0.3.2
[pip3] torchvision==0.9.1
[conda] _tflow_select 2.3.0 mkl
[conda] blas 1.0 mkl conda-forge
[conda] cudatoolkit 11.1.74 h6bb024c_0 nvidia
[conda] mkl 2020.2 256
[conda] mkl-service 2.3.0 py39he8ac12f_0
[conda] mkl_fft 1.3.0 py39h54f3939_0
[conda] mkl_random 1.0.2 py39h63df603_0
[conda] numpy 1.19.2 py39h89c1606_0
[conda] numpy-base 1.19.2 py39h2ae0177_0
[conda] pytorch 1.8.1 py3.9_cuda11.1_cudnn8.0.5_0 pytorch
[conda] tensorflow 2.4.1 mkl_py39h4683426_0
[conda] tensorflow-base 2.4.1 mkl_py39h43e0292_0
[conda] torchaudio 0.8.1 py39 pytorch
[conda] torchmetrics 0.3.2 pyhd8ed1ab_0 conda-forge
[conda] torchvision 0.9.1 py39_cu111 pytorch
1条答案
按热度按时间lb3vh1jj1#
我最近在将我的gpu容器从nvidia docker迁移到podman时遇到了这个错误。对我来说,根本原因是/dev/nvidia-uvm* 文件丢失了,而CUDA显然需要这些文件。请检查您是否有这些文件:
sudo nvidia-modprobe -c 0 -u
应该加载内核模块并创建这些dev文件,如果你没有看到它们的话。或者从ubuntu中查找/sbin/create-uvm-dev-node
脚本,他们创建了这些脚本来修复他们的问题。如果你在一个容器/虚拟机中使用GPU,这些开发文件也需要存在于容器中。通常nvidia运行时脚本会负责传递它们。如果没有发生这种情况,你可以尝试传递一些显式的
--device /dev/nvidia-uvm --device /dev/nvidia-uvm-tools
到docker/podman run。