当尝试创建可以使用GPU的Pod时,我收到错误"exec:"英伟达-smi":在$PATH ""中找不到可执行文件。为了从一开始就解释该错误,我的主要目标是创建可以使用GPU的JupyterHub环境。我安装了Zero to JupyterHub for Kubernetes。我按照以下步骤操作,以便能够使用GPU。当我检查节点时,GPU似乎可由Kubernetes调度。到目前为止,一切似乎正常。
kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'
NAME GPUs
arge-server 1
但是,当我登录JupyetHub并试图使用GPU打开配置文件时,我得到了一个错误:[警告] 0/1个节点可用:1 www.example.com不足。因此,我检查了Pod,发现它们都在"等待:nvidia.com/gpu. So, I checked the Pods and I found that they were all in the "Waiting: PodInitializing" state.
kubectl get pods -n gpu-operator-resources
NAME READY STATUS RESTARTS AGE
nvidia-dcgm-x5rqs 0/1 Init:0/1 2 6d20h
nvidia-device-plugin-daemonset-jhjhb 0/1 Init:0/1 0 6d20h
gpu-feature-discovery-pd4xv 0/1 Init:0/1 2 6d20h
nvidia-dcgm-exporter-7mjgt 0/1 Init:0/1 2 6d20h
nvidia-operator-validator-9xjmv 0/1 Init:Error 10 26m
之后,我仔细查看了Pod nvidia-operator-validator-9xjmv,这是错误的开始,我看到工具箱验证容器抛出了CrashLoopBackOff错误。
kubectl describe pod nvidia-operator-validator-9xjmv -n gpu-operator-resources
Name: nvidia-operator-validator-9xjmv
Namespace: gpu-operator-resources
.
.
.
Controlled By: DaemonSet/nvidia-operator-validator
Init Containers:
.
.
.
toolkit-validation:
Container ID: containerd://e7d004f0809cbefdae5407ea42eb659972ea7eefa5dd6e45e968cbf3ed22bf2e
Image: nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2
Image ID: nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:a07fd1c74e3e469ac316d17cf79635173764fdab3b681dbc282027a23dbbe227
Port: <none>
Host Port: <none>
Command:
sh
-c
Args:
nvidia-validator
State: Waiting
Reason: CrashLoopBackOff
Last State: Terminated
Reason: Error
Exit Code: 1
Started: Thu, 18 Nov 2021 12:55:00 +0300
Finished: Thu, 18 Nov 2021 12:55:00 +0300
Ready: False
Restart Count: 16
Environment:
WITH_WAIT: false
COMPONENT: toolkit
Mounts:
/run/nvidia/validations from run-nvidia-validations (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hx7ls (ro)
.
.
.
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 58m default-scheduler Successfully assigned gpu-operator-resources/nvidia-operator-validator-9xjmv to arge-server
Normal Pulled 58m kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 58m kubelet Created container driver-validation
Normal Started 58m kubelet Started container driver-validation
Normal Pulled 56m (x5 over 58m) kubelet Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
Normal Created 56m (x5 over 58m) kubelet Created container toolkit-validation
Normal Started 56m (x5 over 58m) kubelet Started container toolkit-validation
Warning BackOff 3m7s (x255 over 58m) kubelet Back-off restarting failed container
然后,我查看了容器的日志,并得到了以下错误。
kubectl logs -n gpu-operator-resources -f nvidia-operator-validator-9xjmv -c toolkit-validation
time="2021-11-18T09:29:24Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
toolkit is not ready
对于类似的问题,有人建议删除失败的Pod和部署。然而,这样做并没有解决我的问题。你有什么建议吗?
我有;
- Ubuntu 20.04
- Kubernetes版本1.21.6
- 装卸器2010年10月20日
- 英伟达-标准集成电路470.82.01
- CUDA 11.4
- CPU:英特尔至强E5 - 2683 v4(32),主频2.097 GHz
- GPU:NVIDIA® GeForce® RTX 2080钛版
- 内存:13815兆字节/48280兆字节
先谢了。
1条答案
按热度按时间hlswsv351#
如果您仍然遇到这个问题,我们的群集上刚刚遇到了同样的问题,“脏”修复方法是:
原因是nvidia-operator-validator的初始化pod尝试在/run/nvidia/driver..的chroot中执行nvidia-smi,这是一个tmpfs(因此在重新启动时不会持续),并且在执行驱动程序的手动安装时不会填充。
希望英伟达能提供更好的解决方案。