Skip to content

Increase startupProbe rate for plugin pods#872

Merged
jgehrcke merged 1 commit intokubernetes-sigs:mainfrom
jgehrcke:jp/plugin-startup
Feb 13, 2026
Merged

Increase startupProbe rate for plugin pods#872
jgehrcke merged 1 commit intokubernetes-sigs:mainfrom
jgehrcke:jp/plugin-startup

Conversation

@jgehrcke
Copy link
Copy Markdown
Contributor

@jgehrcke jgehrcke commented Feb 12, 2026

In many cases, both the GPU as well as the CD plugin pods are ready to serve requests a couple hundred milliseconds after startup.

The default for initialDelaySeconds (for a startup probe) is zero. That is, the first startup probe happens pretty much immediately, and is likely to fail.

With periodSeconds: 10, we then detect readiness only ~10 seconds after pod startup.

The probe itself is cheap; we can safely increase the probing rate to allow for more snappy readiness detection.

The overall timeout duration has to remain ~long, for the reasons laid out in #774.

--

Downsides?

  • Each probe as of now triggers a noop NodePrepareResources request which triggers log messages (at least when on chattiness level 4 and higher). That is, during an actual slow startup there would be more of those messages logged per time. That's a tolerable downside which, if we perceive it as a problem, can be addressed in its own ways without reducing the frequency.
  • Does anything else come to mind?

In many cases, both the GPU as well as the CD plugin pods
are ready to serve requests a couple hundred milliseconds
after startup.

The default for `initialDelaySeconds` is zero. That is,
the first startup probe happens pretty much immediately,
and is likely to fail.

With `periodSeconds: 10`, hence, we typically fail the first
startup probe and then detect readiness only ~10 seconds
after pod startup.

The probe itself is cheap; we can safely increase the
probing rate to allow for more snappy readiness detection.

The overall timeout duration has to remain ~long, for the
reasons laid out in kubernetes-sigs#774.

Signed-off-by: Dr. Jan-Philip Gehrcke <jgehrcke@nvidia.com>
@jgehrcke
Copy link
Copy Markdown
Contributor Author

jgehrcke commented Feb 12, 2026

Because we often re-install the Helm chart (and then wait for kubelet plugin readiness), this change has a significant impact on the overall duration of the test suite. Some examples follow:

Overall CI check duration bats-k8s134:

  • before: ~9 min
  • after: ~6 min

GPU plugin upgrade test:

  • before: ~ 45 s
  • after: ~30 s

Helm install + wait (explaining the above's speedups):

  • before: ~15 s
  • after: ~6 s

@jgehrcke jgehrcke self-assigned this Feb 12, 2026
@jgehrcke jgehrcke added the perf issue/pr related to performance label Feb 12, 2026
@jgehrcke jgehrcke added this to the v26.4.0 milestone Feb 12, 2026
@jgehrcke jgehrcke merged commit bed305f into kubernetes-sigs:main Feb 13, 2026
25 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

perf issue/pr related to performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants