kubernetes nvidia设备插件守护程序启动不工作

w8f9ii69  于 2022-12-11  发布在  Kubernetes
关注(0)|答案(1)|浏览(176)

I want recognizes and uses GPUs in Kubernetes on servers/PCs equipped with Nvidia GPUs.
So I try to enabling GPU support in Kubernetes from following NVIDIA device plugin for Kubernetes , but daemonset is not working.

things done

  1. install cri-dockerd
    confirm: $ cri-dockerd --version
cri-dockerd 0.2.6 (d8accf7)

confirm: $ systemctl status cri-docker.socket

● cri-docker.socket - CRI Docker Socket for the API
     Loaded: loaded (/etc/systemd/system/cri-docker.socket; enabled; vendor preset: enabled)
     Active: active (running) since Mon 2022-12-05 15:00:35 KST; 18h ago
   Triggers: ● cri-docker.service
     Listen: /run/cri-dockerd.sock (Stream)
      Tasks: 0 (limit: 18968)
     Memory: 4.0K
     CGroup: /system.slice/cri-docker.socket

12월 05 15:00:35 hibernation systemd[1]: Starting CRI Docker Socket for the API.
12월 05 15:00:35 hibernation systemd[1]: Listening on CRI Docker Socket for the API.
  • install nvidia docker confirm: $ sudo docker run --rm --gpus all nvidia/cuda:11.3.1-base-ubuntu20.04 nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.161.03   Driver Version: 470.161.03   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:07:00.0 Off |                  N/A |
|  0%   28C    P8    14W / 180W |     64MiB / 12052MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

confirm /etc/docker/daemon.json

{
  "default-runtime": "nvidia",
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  }
}

confirm /etc/containerd/config.toml

# disabled_plugins = ["cri"]        # to annotated
version = 2
[plugins]
  [plugins."io.containerd.grpc.v1.cri"]
    [plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvidia"

      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runc.v2"
          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
  1. install kubectl=1.22.13-00 kubelet=1.22.13-00 kubeadm=1.22.13-00
    confirm: $ kubectl version --client && kubeadm version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.13", GitCommit:"a43c0904d0de10f92aa3956c74489c45e6453d6e", GitTreeState:"clean", BuildDate:"2022-08-17T18:28:56Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}
kubeadm version: &version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.13", GitCommit:"a43c0904d0de10f92aa3956c74489c45e6453d6e", GitTreeState:"clean", BuildDate:"2022-08-17T18:27:51Z", GoVersion:"go1.16.15", Compiler:"gc", Platform:"linux/amd64"}

confirm: $ systemctl status kubelet

● kubelet.service - kubelet: The Kubernetes Node Agent
     Loaded: loaded (/lib/systemd/system/kubelet.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/kubelet.service.d
             └─10-kubeadm.conf
     Active: active (running) since Mon 2022-12-05 15:11:30 KST; 18h ago
       Docs: https://kubernetes.io/docs/home/
  1. master node init
$ sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16 \
  --apiserver-advertise-address 192.168.219.100\
  --cri-socket /run/cri-dockerd.sock

confirm: $ kubectl get nodes

NAME             STATUS   ROLES                  AGE     VERSION
hibernation      Ready    control-plane,master   3m21s   v1.22.13

What I try

Enabling GPU support in Kubernetes

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.13.0/nvidia-device-plugin.yml

confirm GPU support.: $ kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
I expect GPU have 1 , but get this.

NAME             GPU
hibernation      <none>

This command print nothing. : $ kubectl get pod -A | grep nvidia
confirm: $ kubectl describe daemonset nvidia-device-plugin-daemonset -n kube-system

Name:           nvidia-device-plugin-daemonset
Selector:       name=nvidia-device-plugin-ds
Node-Selector:  <none>
Labels:         <none>
Annotations:    deprecated.daemonset.template.generation: 1
Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0
Number of Nodes Misscheduled: 0
Pods Status:  0 Running / 0 Waiting / 0 Succeeded / 0 Failed
Pod Template:
  Labels:  name=nvidia-device-plugin-ds
  Containers:
   nvidia-device-plugin-ctr:
    Image:      nvcr.io/nvidia/k8s-device-plugin:v0.13.0
    Port:       <none>
    Host Port:  <none>
    Environment:
      FAIL_ON_INIT_ERROR:  false
    Mounts:
      /var/lib/kubelet/device-plugins from device-plugin (rw)
  Volumes:
   device-plugin:
    Type:               HostPath (bare host directory volume)
    Path:               /var/lib/kubelet/device-plugins
    HostPathType:       
  Priority Class Name:  system-node-critical
Events:                 <none>

I've noticed status, but I don't know what I can do about it.

Desired Number of Nodes Scheduled: 0
Current Number of Nodes Scheduled: 0
Number of Nodes Scheduled with Up-to-date Pods: 0
Number of Nodes Scheduled with Available Pods: 0

env:

Ubunu 20.04 
GPU: NVIDIA GeForce RTX 3060

Since this is the first installation after formatting the desktop, there are no other unnecessary programs.

oxiaedzo

oxiaedzo1#

I found a way!
Since Kubernetes 1.6, DaemonSets do not schedule on master nodes by default. In order to schedule it on master, I had to add a toleration into the spec section.

  1. check deamonset name
$ kubectl get daemonset -n kube-system
  1. add to toleration into the spec section.
- key: node-role.kubernetes.io/master
  effect: NoSchedule
$ kubectl edit daemonset nvidia-device-plugin-daemonset -n kube-system
  1. go restart
$ kubectl rollout restart daemonset nvidia-device-plugin-daemonset -n kube-system

if you want confirm changes,

$ kubectl get daemonset nvidia-device-plugin-daemonset -n kube-system -o yaml
tolerations:        
     - effect: NoSchedule                         # phrase added
       key: node-role.kubernetes.io/master        # phrase added
  1. confirm enabling GPU support in Kubernetes
$ kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"
NAME             GPU
hibernation      1

confirm Pod operation

$ kubectl get pod -A | grep nvidia
kube-system   nvidia-device-plugin-daemonset-k228k     1/1     Running   0              16m

相关问题