Skip to content

Retry device enumeration on startup to prevent empty ResourceSlices#1009

Open
kasia-kujawa wants to merge 1 commit intokubernetes-sigs:mainfrom
kasia-kujawa:kkujawa_resoruceslice_empty
Open

Retry device enumeration on startup to prevent empty ResourceSlices#1009
kasia-kujawa wants to merge 1 commit intokubernetes-sigs:mainfrom
kasia-kujawa:kkujawa_resoruceslice_empty

Conversation

@kasia-kujawa
Copy link
Copy Markdown
Contributor

@kasia-kujawa kasia-kujawa commented Apr 10, 2026

Fixes #1008

Added a retry loop in NewDeviceState().
If the first enumeration returns 0 devices, the plugin retries every 5 seconds for up to 5 minutes before proceeding.
Errors still propagate immediately without retry.

Before the fix, no log was emitted after Traverse GPU devices and the empty ResourceSlice was published silently.

With the fix (nvidiaDriverRoot: /home/kubernetes/bin/nvidia/, GKE COS, Tesla T4):

I0410 08:11:16.614833  1 nvlib.go:197] Traverse GPU devices
I0410 08:11:16.779628  1 device_state.go:96] No GPU devices found yet (driver may still be initializing), retrying in 5s...
I0410 08:11:21.780288  1 nvlib.go:197] Traverse GPU devices
I0410 08:11:23.111832  1 nvlib.go:278] Adding device gpu-0 to allocatable devices
I0410 08:11:23.111862  1 allocatable.go:254] Adding allocatables for PCI bus ID: 0000:00:04.0

Full logs with the fix from nvidia-dra-driver-gpu-kubelet-plugin when GPU initialization needed slightly more time:
https://gist.github.com/kasia-kujawa/1082b48357a0ae80d663f12ee665e34c

Signed-off-by: Katarzyna Kujawa <katarzyna@cast.ai>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 10, 2026
@k8s-ci-robot k8s-ci-robot requested a review from klueska April 10, 2026 11:46
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: kasia-kujawa
Once this PR has been reviewed and has the lgtm label, please assign shivamerla for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from shivamerla April 10, 2026 11:46
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: ResourceSlice published with no devices on GKE

2 participants