Skip to content

computedomain-daemon DaemonSet missing imagePullSecrets causes ImagePullBackOff on Kubernetes 1.35+ #958

@shengnuo

Description

@shengnuo

Summary

The computedomain-daemon DaemonSet created by the compute-domain-controller does not include imagePullSecrets or imagePullPolicy in its pod spec. When using a private container registry and running on Kubernetes 1.35+ with the KubeletEnsureSecretPulledImages feature gate enabled (default), these pods fail with ImagePullBackOff even when the same image is already present on the node.

Affected Components

  • Template: templates/compute-domain-daemon.tmpl.yaml – no imagePullSecrets or imagePullPolicy in the pod spec
  • Controller: cmd/compute-domain-controller/daemonset.goDaemonSetTemplateData does not include image pull settings; only ImageName is passed

Behavior Before Kubernetes 1.35

Previously, container images were effectively shared at the node level. If one pod (e.g., the nvidia-dra-driver-gpu-kubelet-plugin) pulled a private image using imagePullSecrets, the image was cached on the node. Other pods on the same node (e.g., computedomain-daemon) could use that cached image without specifying credentials, because the kubelet did not enforce that the pod had valid pull credentials for already-pulled images.

This allowed the computedomain-daemon to run successfully without imagePullSecrets as long as the kubelet-plugin (or another pod) had already pulled the same image on that node.

Behavior on Kubernetes 1.35+

In Kubernetes 1.35, the KubeletEnsureSecretPulledImages feature gate (KEP-2535) is enabled by default (beta). This feature ensures that images pulled with imagePullSecrets are only used by pods that have valid credentials for that image.

When enabled, the kubelet:

  1. Tracks which credentials were used to pull each image.
  2. Requires that any pod using an image must have appropriate imagePullSecrets (or equivalent credentials) before the kubelet allows use of that image.
  3. Rejects pods that lack credentials for a private image, even if the image is already cached on the node.

As a result:

  • nvidia-dra-driver-gpu-kubelet-plugin – Has imagePullSecrets and imagePullPolicy: Always from the Helm chart → runs successfully.
  • computedomain-daemon – Has no imagePullSecrets and defaults to imagePullPolicy: IfNotPresent → fails with ImagePullBackOff because the kubelet will not allow use of the cached image without credentials.

Both pods use the same image on the same node, but only the kubelet-plugin has imagePullSecrets.

Proposed Fix

  1. Fetch these values from the kubelet plugin spec and store them in-memory. (TBD)
  2. Add imagePullSecrets and imagePullPolicy to the computedomain-daemon template (templates/compute-domain-daemon.tmpl.yaml).
  3. Extend DaemonSetTemplateData in cmd/compute-domain-controller/daemonset.go to include image pull settings.
    The fix should align the computedomain-daemon with how the kubelet-plugin, controller, and webhook are configured in deployments/helm/nvidia-dra-driver-gpu/templates/ (which use .Values.imagePullSecrets and .Values.image.pullPolicy).

Workaround

Until the fix is available, users can:

  1. Disable the feature gate (not recommended for production): Set imagePullCredentialsVerificationPolicy: NeverVerify in the kubelet config. This restores the old behavior but weakens image access control.
  2. Use a public image for the driver, if acceptable for the environment.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.

Type

No type

Projects

Status

Backlog

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions