Summary
The computedomain-daemon DaemonSet created by the compute-domain-controller does not include imagePullSecrets or imagePullPolicy in its pod spec. When using a private container registry and running on Kubernetes 1.35+ with the KubeletEnsureSecretPulledImages feature gate enabled (default), these pods fail with ImagePullBackOff even when the same image is already present on the node.
Affected Components
- Template:
templates/compute-domain-daemon.tmpl.yaml – no imagePullSecrets or imagePullPolicy in the pod spec
- Controller:
cmd/compute-domain-controller/daemonset.go – DaemonSetTemplateData does not include image pull settings; only ImageName is passed
Behavior Before Kubernetes 1.35
Previously, container images were effectively shared at the node level. If one pod (e.g., the nvidia-dra-driver-gpu-kubelet-plugin) pulled a private image using imagePullSecrets, the image was cached on the node. Other pods on the same node (e.g., computedomain-daemon) could use that cached image without specifying credentials, because the kubelet did not enforce that the pod had valid pull credentials for already-pulled images.
This allowed the computedomain-daemon to run successfully without imagePullSecrets as long as the kubelet-plugin (or another pod) had already pulled the same image on that node.
Behavior on Kubernetes 1.35+
In Kubernetes 1.35, the KubeletEnsureSecretPulledImages feature gate (KEP-2535) is enabled by default (beta). This feature ensures that images pulled with imagePullSecrets are only used by pods that have valid credentials for that image.
When enabled, the kubelet:
- Tracks which credentials were used to pull each image.
- Requires that any pod using an image must have appropriate
imagePullSecrets (or equivalent credentials) before the kubelet allows use of that image.
- Rejects pods that lack credentials for a private image, even if the image is already cached on the node.
As a result:
- nvidia-dra-driver-gpu-kubelet-plugin – Has
imagePullSecrets and imagePullPolicy: Always from the Helm chart → runs successfully.
- computedomain-daemon – Has no
imagePullSecrets and defaults to imagePullPolicy: IfNotPresent → fails with ImagePullBackOff because the kubelet will not allow use of the cached image without credentials.
Both pods use the same image on the same node, but only the kubelet-plugin has imagePullSecrets.
Proposed Fix
- Fetch these values from the kubelet plugin spec and store them in-memory. (TBD)
- Add
imagePullSecrets and imagePullPolicy to the computedomain-daemon template (templates/compute-domain-daemon.tmpl.yaml).
- Extend
DaemonSetTemplateData in cmd/compute-domain-controller/daemonset.go to include image pull settings.
The fix should align the computedomain-daemon with how the kubelet-plugin, controller, and webhook are configured in deployments/helm/nvidia-dra-driver-gpu/templates/ (which use .Values.imagePullSecrets and .Values.image.pullPolicy).
Workaround
Until the fix is available, users can:
- Disable the feature gate (not recommended for production): Set
imagePullCredentialsVerificationPolicy: NeverVerify in the kubelet config. This restores the old behavior but weakens image access control.
- Use a public image for the driver, if acceptable for the environment.
Summary
The computedomain-daemon DaemonSet created by the compute-domain-controller does not include
imagePullSecretsorimagePullPolicyin its pod spec. When using a private container registry and running on Kubernetes 1.35+ with theKubeletEnsureSecretPulledImagesfeature gate enabled (default), these pods fail withImagePullBackOffeven when the same image is already present on the node.Affected Components
templates/compute-domain-daemon.tmpl.yaml– noimagePullSecretsorimagePullPolicyin the pod speccmd/compute-domain-controller/daemonset.go–DaemonSetTemplateDatadoes not include image pull settings; onlyImageNameis passedBehavior Before Kubernetes 1.35
Previously, container images were effectively shared at the node level. If one pod (e.g., the nvidia-dra-driver-gpu-kubelet-plugin) pulled a private image using
imagePullSecrets, the image was cached on the node. Other pods on the same node (e.g., computedomain-daemon) could use that cached image without specifying credentials, because the kubelet did not enforce that the pod had valid pull credentials for already-pulled images.This allowed the computedomain-daemon to run successfully without
imagePullSecretsas long as the kubelet-plugin (or another pod) had already pulled the same image on that node.Behavior on Kubernetes 1.35+
In Kubernetes 1.35, the
KubeletEnsureSecretPulledImagesfeature gate (KEP-2535) is enabled by default (beta). This feature ensures that images pulled withimagePullSecretsare only used by pods that have valid credentials for that image.When enabled, the kubelet:
imagePullSecrets(or equivalent credentials) before the kubelet allows use of that image.As a result:
imagePullSecretsandimagePullPolicy: Alwaysfrom the Helm chart → runs successfully.imagePullSecretsand defaults toimagePullPolicy: IfNotPresent→ fails withImagePullBackOffbecause the kubelet will not allow use of the cached image without credentials.Both pods use the same image on the same node, but only the kubelet-plugin has
imagePullSecrets.Proposed Fix
imagePullSecretsandimagePullPolicyto the computedomain-daemon template (templates/compute-domain-daemon.tmpl.yaml).DaemonSetTemplateDataincmd/compute-domain-controller/daemonset.goto include image pull settings.The fix should align the computedomain-daemon with how the kubelet-plugin, controller, and webhook are configured in
deployments/helm/nvidia-dra-driver-gpu/templates/(which use.Values.imagePullSecretsand.Values.image.pullPolicy).Workaround
Until the fix is available, users can:
imagePullCredentialsVerificationPolicy: NeverVerifyin the kubelet config. This restores the old behavior but weakens image access control.