Kubernetes GPU Pod错误:验证工具包安装:执行:“nvidia-smi ":在$PATH中找不到可执行文件”

inkz8wg9  于 2023-01-12  发布在  Kubernetes
关注(0)|答案(1)|浏览(193)

当尝试创建可以使用GPU的Pod时,我收到错误"exec:"英伟达-smi":在$PATH ""中找不到可执行文件。为了从一开始就解释该错误,我的主要目标是创建可以使用GPU的JupyterHub环境。我安装了Zero to JupyterHub for Kubernetes。我按照以下步骤操作,以便能够使用GPU。当我检查节点时,GPU似乎可由Kubernetes调度。到目前为止,一切似乎正常。

kubectl get nodes -o=custom-columns=NAME:.metadata.name,GPUs:.status.capacity.'nvidia\.com/gpu'

NAME          GPUs
arge-server   1

但是,当我登录JupyetHub并试图使用GPU打开配置文件时,我得到了一个错误:[警告] 0/1个节点可用:1 www.example.com不足。因此,我检查了Pod,发现它们都在"等待:nvidia.com/gpu. So, I checked the Pods and I found that they were all in the "Waiting: PodInitializing" state.

kubectl get pods -n gpu-operator-resources

NAME                                   READY   STATUS       RESTARTS   AGE
nvidia-dcgm-x5rqs                      0/1     Init:0/1     2          6d20h
nvidia-device-plugin-daemonset-jhjhb   0/1     Init:0/1     0          6d20h
gpu-feature-discovery-pd4xv            0/1     Init:0/1     2          6d20h
nvidia-dcgm-exporter-7mjgt             0/1     Init:0/1     2          6d20h
nvidia-operator-validator-9xjmv        0/1     Init:Error   10         26m

之后,我仔细查看了Pod nvidia-operator-validator-9xjmv,这是错误的开始,我看到工具箱验证容器抛出了CrashLoopBackOff错误。

kubectl describe pod nvidia-operator-validator-9xjmv -n gpu-operator-resources

    Name:                 nvidia-operator-validator-9xjmv
    Namespace:            gpu-operator-resources
        .   
        .
        .
    Controlled By:  DaemonSet/nvidia-operator-validator
    Init Containers:
        .
        .
        .
      toolkit-validation:
        Container ID:  containerd://e7d004f0809cbefdae5407ea42eb659972ea7eefa5dd6e45e968cbf3ed22bf2e
        Image:         nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2
        Image ID:      nvcr.io/nvidia/cloud-native/gpu-operator-validator@sha256:a07fd1c74e3e469ac316d17cf79635173764fdab3b681dbc282027a23dbbe227
        Port:          <none>
        Host Port:     <none>
        Command:
          sh
          -c
        Args:
          nvidia-validator
        State:          Waiting
          Reason:       CrashLoopBackOff
        Last State:     Terminated
          Reason:       Error
          Exit Code:    1
          Started:      Thu, 18 Nov 2021 12:55:00 +0300
          Finished:     Thu, 18 Nov 2021 12:55:00 +0300
        Ready:          False
        Restart Count:  16
        Environment:
          WITH_WAIT:  false
          COMPONENT:  toolkit
        Mounts:
          /run/nvidia/validations from run-nvidia-validations (rw)
          /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-hx7ls (ro)
        .   
        .
        .
    
    Events:
      Type     Reason     Age                   From               Message
      ----     ------     ----                  ----               -------
      Normal   Scheduled  58m                   default-scheduler  Successfully assigned gpu-operator-resources/nvidia-operator-validator-9xjmv to arge-server
      Normal   Pulled     58m                   kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
      Normal   Created    58m                   kubelet            Created container driver-validation
      Normal   Started    58m                   kubelet            Started container driver-validation
      Normal   Pulled     56m (x5 over 58m)     kubelet            Container image "nvcr.io/nvidia/cloud-native/gpu-operator-validator:v1.8.2" already present on machine
      Normal   Created    56m (x5 over 58m)     kubelet            Created container toolkit-validation
      Normal   Started    56m (x5 over 58m)     kubelet            Started container toolkit-validation
      Warning  BackOff    3m7s (x255 over 58m)  kubelet            Back-off restarting failed container

然后,我查看了容器的日志,并得到了以下错误。

kubectl logs -n gpu-operator-resources -f nvidia-operator-validator-9xjmv -c toolkit-validation

time="2021-11-18T09:29:24Z" level=info msg="Error: error validating toolkit installation: exec: \"nvidia-smi\": executable file not found in $PATH"
toolkit is not ready

对于类似的问题,有人建议删除失败的Pod和部署。然而,这样做并没有解决我的问题。你有什么建议吗?
我有;

  • Ubuntu 20.04
  • Kubernetes版本1.21.6
  • 装卸器2010年10月20日
  • 英伟达-标准集成电路470.82.01
  • CUDA 11.4
  • CPU:英特尔至强E5 - 2683 v4(32),主频2.097 GHz
  • GPU:NVIDIA® GeForce® RTX 2080钛版
  • 内存:13815兆字节/48280兆字节

先谢了。

hlswsv35

hlswsv351#

如果您仍然遇到这个问题,我们的群集上刚刚遇到了同样的问题,“脏”修复方法是:

rm /run/nvidia/driver
ln -s / /run/nvidia/drive
kubectl delete pod -n gpu-operator nvidia-operator-validator-xxxxx

原因是nvidia-operator-validator的初始化pod尝试在/run/nvidia/driver..的chroot中执行nvidia-smi,这是一个tmpfs(因此在重新启动时不会持续),并且在执行驱动程序的手动安装时不会填充。
希望英伟达能提供更好的解决方案。

相关问题