Skip to content

feat(operator): register TrainJobStatus server with controller manager healthz and readyz probes#3338

Open
krishdef7 wants to merge 4 commits intokubeflow:masterfrom
krishdef7:feat/statusserver-healthz-probe
Open

feat(operator): register TrainJobStatus server with controller manager healthz and readyz probes#3338
krishdef7 wants to merge 4 commits intokubeflow:masterfrom
krishdef7:feat/statusserver-healthz-probe

Conversation

@krishdef7
Copy link
Contributor

@krishdef7 krishdef7 commented Mar 14, 2026

What this PR does / why we need it

The TrainJobStatus server introduced in #3227 runs in the same process as the controller manager but was not registered with its /healthz and /readyz probes. This meant:

  • If the status server crashed mid-job, the controller pod stayed Running with no signal from the Kubernetes liveness probe
  • Training pods had no way to verify the server was ready before sending their first status update, causing silent failures during startup while TLS and the OIDC provider were still initializing

This PR follows the same pattern already used by the webhook server in the same process.

Changes

  • Add ready atomic.Bool to Server, set to true after ListenAndServeTLS starts and false on shutdown
  • Implement healthz.Checker via Check(*http.Request) error on Server
  • Register healthz.Ping for liveness and a closure-based readyz check with mgr.AddHealthzCheck/mgr.AddReadyzCheck in main.go, before mgr.Start(), probe registration must happen before the manager starts
  • Call statusserver.SetupServer inside setupManagerComponents goroutine after <-certsReady, since SetupTLSConfig requires cert files to exist on disk
  • Add TestServerCheck covering not-ready → ready → not-ready transitions

Fixes #3337

Checklist

  • Docs included if any changes are user facing
    No docs change needed, this is an internal health signaling improvement with no new user-facing API surface.

Copilot AI review requested due to automatic review settings March 14, 2026 21:49
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@krishdef7 krishdef7 changed the title feat(statusserver): register TrainJobStatus server with controller manager healthz and readyz probes feat(controller): register TrainJobStatus server with controller manager healthz and readyz probes Mar 14, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR wires the in-process TrainJobStatus HTTPS server into the controller manager’s health and readiness probe endpoints so Kubernetes can detect startup readiness and runtime failures.

Changes:

  • Register the status server as a /healthz and /readyz checker in SetupServer.
  • Add an atomic.Bool readiness flag to the status server and expose it via Check(*http.Request) error.
  • Add a unit test covering Check() readiness transitions.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
pkg/statusserver/setup.go Registers the status server with manager healthz/readyz checks.
pkg/statusserver/server.go Tracks readiness state and implements a probe checker.
pkg/statusserver/server_test.go Adds a unit test for the new checker behavior.

@krishdef7 krishdef7 force-pushed the feat/statusserver-healthz-probe branch from bf2f59b to 8fb0b1f Compare March 14, 2026 21:52
@krishdef7 krishdef7 changed the title feat(controller): register TrainJobStatus server with controller manager healthz and readyz probes feat(operator): register TrainJobStatus server with controller manager healthz and readyz probes Mar 14, 2026
…r readyz and healthz probes

Register the TrainJobStatus server with mgr.AddHealthzCheck and
mgr.AddReadyzCheck so Kubernetes liveness/readiness probes reflect
status server health.

Adds atomic ready field to Server, set to true after TLS initialization
and before ListenAndServeTLS, set to false on shutdown. Implements
healthz.Checker via Check(*http.Request) error.

Fixes kubeflow#3337

Signed-off-by: krishdef7 <gargkrish06@gmail.com>
@krishdef7 krishdef7 force-pushed the feat/statusserver-healthz-probe branch from 8fb0b1f to 4f1be1d Compare March 14, 2026 21:58
@krishdef7
Copy link
Contributor Author

/cc @andreyvelich @robert-bell

@google-oss-prow google-oss-prow bot requested a review from andreyvelich March 14, 2026 22:00
@google-oss-prow
Copy link

@krishdef7: GitHub didn't allow me to request PR reviews from the following users: robert-bell.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @andreyvelich @robert-bell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@krishdef7
Copy link
Contributor Author

The E2E failures appear to be pre-existing from #3227 rather than introduced by this PR. The controller crashes within ~2s with --feature-gates=TrainJobStatus=true before any probes fire (probes have 10-15s delays), which was already the case when #3227 was merged manually due to GPU runner outages. This PR only adds AddHealthzCheck/AddReadyzCheck calls in setup.go after all existing crash-prone initialization, SetupTLSConfig, createClient, and NewServer are unchanged. Happy to investigate the root cause of the pre-existing E2E failure if helpful.

@krishdef7
Copy link
Contributor Author

Correction to my earlier comment: the E2E failures are introduced by this PR, not pre-existing from #3227. Root cause identified: mgr.AddHealthzCheck and mgr.AddReadyzCheck are called from inside setupManagerComponents, which runs as a goroutine concurrently with mgr.Start(). controller-runtime rejects these registrations once the manager has started, returning a non-nil error that propagates to os.Exit(1). This explains the deterministic ~2s crash, the race is reliably lost in the E2E environment. Unit tests mock the manager so they never hit this path. Fix: move statusserver.SetupServer to the synchronous setup path in main.go, before the goroutine starts. Working on it now.

…r.Start()

Signed-off-by: krishdef7 <gargkrish06@gmail.com>
Signed-off-by: krishdef7 <gargkrish06@gmail.com>
Signed-off-by: krishdef7 <gargkrish06@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: wire TrainJobStatus server into controller manager readyz and healthz probes

2 participants