Skip to content

NVIDIA device plugin 0.15.0 and newer don't run on Talos nodes by default #12103

@rothgar

Description

@rothgar

Bug Report

When following the docs for proprietary or OSS drivers the device plugin doesn't work beyond 0.14.5.

Description

Using the default helm install the daemonset pods never get scheduled on nodes.

helm install nvidia-device-plugin nvdp/nvidia-device-plugin --version=0.14.5 --set=runtimeClassName=nvidia --namespace kube-system

https://github.com/NVIDIA/k8s-device-plugin/releases/tag/v0.15.0

Logs

Here's the device plugin daemonset for 0.18.0 which never has pods created by the controller manager.

Name:           nvidia-device-plugin                                                                                                                                                                                
Namespace:      kube-system                                                                                                                                                                                         
Selector:       app.kubernetes.io/instance=nvidia-device-plugin,app.kubernetes.io/name=nvidia-device-plugin                                                                                                         
Node-Selector:  <none>                                                                                                                                                                                              
Labels:         app.kubernetes.io/instance=nvidia-device-plugin                                                                                                                                                     
                app.kubernetes.io/managed-by=Helm                                                                                                                                                                   
                app.kubernetes.io/name=nvidia-device-plugin                                                                                                                                                         
                app.kubernetes.io/version=0.18.0                                                                                                                                                                    
                helm.sh/chart=nvidia-device-plugin-0.18.0                                                                                                                                                           
Annotations:    deprecated.daemonset.template.generation: 1                                                                                                                                                         
                meta.helm.sh/release-name: nvidia-device-plugin                                                                                                                                                     
                meta.helm.sh/release-namespace: kube-system                                                                                                                                                         
Desired Number of Nodes Scheduled: 0                                                                                                                                                                                
Current Number of Nodes Scheduled: 0                                                                                                                                                                                
Number of Nodes Scheduled with Up-to-date Pods: 0                                                                                                                                                                   
Number of Nodes Scheduled with Available Pods: 0                                                                                                                                                                    
Number of Nodes Misscheduled: 0                                                                                                                                                                                     
Pods Status:  0 Running / 0 Waiting / 0 Succeeded / 0 Failed                                                                                                                                                        
Pod Template:                                                                                                                                                                                                       
  Labels:  app.kubernetes.io/instance=nvidia-device-plugin                                                                                                                                                          
           app.kubernetes.io/name=nvidia-device-plugin                                                                                                                                                              
  Containers:                                                                                                                                                                                                       
   nvidia-device-plugin-ctr:                                                                                                                                                                                        
    Image:      nvcr.io/nvidia/k8s-device-plugin:v0.18.0                                                                                                                                                            
    Port:       <none>                                                                                                                                                                                              
    Host Port:  <none>                                                                                                                                                                                              
    Command:                                                                                                                                                                                                        
      nvidia-device-plugin                                                                                                                                                                                          
    Environment:                                                                                                                                                                                                    
      MPS_ROOT:                    /run/nvidia/mps                                                                                                                                                                  
      NVIDIA_VISIBLE_DEVICES:      all                                                                                                                                                                              
      NVIDIA_DRIVER_CAPABILITIES:  compute,utility
    Mounts:                                                                                                                                                                                                         
      /dev/shm from mps-shm (rw)                                                                                                                                                                                    
      /mps from mps-root (rw)                                                                                                                                                                                       
      /var/lib/kubelet/device-plugins from kubelet-device-plugins-dir (rw)                                                                                                                                          
      /var/run/cdi from cdi-root (rw)                                                                                                                                                                               
  Volumes:                                                                                                                                                                                                          
   kubelet-device-plugins-dir:                                                                                                                                                                                      
    Type:          HostPath (bare host directory volume)
    Path:          /var/lib/kubelet/device-plugins
    HostPathType:  Directory
   mps-root:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/mps
    HostPathType:  DirectoryOrCreate
   mps-shm:
    Type:          HostPath (bare host directory volume)
    Path:          /run/nvidia/mps/shm
    HostPathType:  
   cdi-root:
    Type:               HostPath (bare host directory volume)
    Path:               /var/run/cdi
    HostPathType:       DirectoryOrCreate
  Priority Class Name:  system-node-critical
  Node-Selectors:       <none>
  Tolerations:          CriticalAddonsOnly op=Exists
                        nvidia.com/gpu:NoSchedule op=Exists
Events:                 <none>

Environment

  • Talos version: 1.11.3
  • Kubernetes version: 1.34.1
  • Platform: metal

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions