Increase startupProbe rate for plugin pods by jgehrcke · Pull Request #872 · kubernetes-sigs/dra-driver-nvidia-gpu

jgehrcke · 2026-02-12T09:31:48Z

In many cases, both the GPU as well as the CD plugin pods are ready to serve requests a couple hundred milliseconds after startup.

The default for initialDelaySeconds (for a startup probe) is zero. That is, the first startup probe happens pretty much immediately, and is likely to fail.

With periodSeconds: 10, we then detect readiness only ~10 seconds after pod startup.

The probe itself is cheap; we can safely increase the probing rate to allow for more snappy readiness detection.

The overall timeout duration has to remain ~long, for the reasons laid out in #774.

--

Downsides?

Each probe as of now triggers a noop NodePrepareResources request which triggers log messages (at least when on chattiness level 4 and higher). That is, during an actual slow startup there would be more of those messages logged per time. That's a tolerable downside which, if we perceive it as a problem, can be addressed in its own ways without reducing the frequency.
Does anything else come to mind?

In many cases, both the GPU as well as the CD plugin pods are ready to serve requests a couple hundred milliseconds after startup. The default for `initialDelaySeconds` is zero. That is, the first startup probe happens pretty much immediately, and is likely to fail. With `periodSeconds: 10`, hence, we typically fail the first startup probe and then detect readiness only ~10 seconds after pod startup. The probe itself is cheap; we can safely increase the probing rate to allow for more snappy readiness detection. The overall timeout duration has to remain ~long, for the reasons laid out in kubernetes-sigs#774. Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>

jgehrcke · 2026-02-12T09:50:56Z

Because we often re-install the Helm chart (and then wait for kubelet plugin readiness), this change has a significant impact on the overall duration of the test suite. Some examples follow:

Overall CI check duration bats-k8s134:

before: ~9 min
after: ~6 min

GPU plugin upgrade test:

before: ~ 45 s
after: ~30 s

Helm install + wait (explaining the above's speedups):

before: ~15 s
after: ~6 s

jgehrcke self-assigned this Feb 12, 2026

jgehrcke added the perf issue/pr related to performance label Feb 12, 2026

jgehrcke added this to the v26.4.0 milestone Feb 12, 2026

klueska approved these changes Feb 13, 2026

View reviewed changes

jgehrcke merged commit bed305f into kubernetes-sigs:main Feb 13, 2026
25 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Increase startupProbe rate for plugin pods#872

Increase startupProbe rate for plugin pods#872
jgehrcke merged 1 commit intokubernetes-sigs:mainfrom
jgehrcke:jp/plugin-startup

jgehrcke commented Feb 12, 2026 •

edited

Loading

Uh oh!

jgehrcke commented Feb 12, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jgehrcke commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jgehrcke commented Feb 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jgehrcke commented Feb 12, 2026 •

edited

Loading

jgehrcke commented Feb 12, 2026 •

edited

Loading