feat(operator): register TrainJobStatus server with controller manager healthz and readyz probes by krishdef7 · Pull Request #3338 · kubeflow/trainer

krishdef7 · 2026-03-14T21:49:03Z

What this PR does / why we need it

The TrainJobStatus server introduced in #3227 runs in the same process as the controller manager but was not registered with its /healthz and /readyz probes. This meant:

If the status server crashed mid-job, the controller pod stayed Running with no signal from the Kubernetes liveness probe
Training pods had no way to verify the server was ready before sending their first status update, causing silent failures during startup while TLS and the OIDC provider were still initializing

This PR follows the same pattern already used by the webhook server in the same process.

Changes

Add ready atomic.Bool to Server, set to true after ListenAndServeTLS starts and false on shutdown
Implement healthz.Checker via Check(*http.Request) error on Server
Register healthz.Ping for liveness and a closure-based readyz check with mgr.AddHealthzCheck/mgr.AddReadyzCheck in main.go, before mgr.Start(), probe registration must happen before the manager starts
Call statusserver.SetupServer inside setupManagerComponents goroutine after <-certsReady, since SetupTLSConfig requires cert files to exist on disk
Add TestServerCheck covering not-ready → ready → not-ready transitions

Fixes #3337

Checklist

Docs included if any changes are user facing
No docs change needed, this is an internal health signaling improvement with no new user-facing API surface.

google-oss-prow · 2026-03-14T21:49:09Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign astefanutti for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR wires the in-process TrainJobStatus HTTPS server into the controller manager’s health and readiness probe endpoints so Kubernetes can detect startup readiness and runtime failures.

Changes:

Register the status server as a /healthz and /readyz checker in SetupServer.
Add an atomic.Bool readiness flag to the status server and expose it via Check(*http.Request) error.
Add a unit test covering Check() readiness transitions.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
pkg/statusserver/setup.go	Registers the status server with manager healthz/readyz checks.
pkg/statusserver/server.go	Tracks readiness state and implements a probe checker.
pkg/statusserver/server_test.go	Adds a unit test for the new checker behavior.

pkg/statusserver/server.go

pkg/statusserver/setup.go

pkg/statusserver/server.go

…r readyz and healthz probes Register the TrainJobStatus server with mgr.AddHealthzCheck and mgr.AddReadyzCheck so Kubernetes liveness/readiness probes reflect status server health. Adds atomic ready field to Server, set to true after TLS initialization and before ListenAndServeTLS, set to false on shutdown. Implements healthz.Checker via Check(*http.Request) error. Fixes kubeflow#3337 Signed-off-by: krishdef7 <gargkrish06@gmail.com>

krishdef7 · 2026-03-14T22:00:44Z

/cc @andreyvelich @robert-bell

google-oss-prow · 2026-03-14T22:00:48Z

@krishdef7: GitHub didn't allow me to request PR reviews from the following users: robert-bell.

Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs.

Details

In response to this:

/cc @andreyvelich @robert-bell

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

krishdef7 · 2026-03-14T22:26:06Z

The E2E failures appear to be pre-existing from #3227 rather than introduced by this PR. The controller crashes within ~2s with --feature-gates=TrainJobStatus=true before any probes fire (probes have 10-15s delays), which was already the case when #3227 was merged manually due to GPU runner outages. This PR only adds AddHealthzCheck/AddReadyzCheck calls in setup.go after all existing crash-prone initialization, SetupTLSConfig, createClient, and NewServer are unchanged. Happy to investigate the root cause of the pre-existing E2E failure if helpful.

krishdef7 · 2026-03-14T22:44:56Z

Correction to my earlier comment: the E2E failures are introduced by this PR, not pre-existing from #3227. Root cause identified: mgr.AddHealthzCheck and mgr.AddReadyzCheck are called from inside setupManagerComponents, which runs as a goroutine concurrently with mgr.Start(). controller-runtime rejects these registrations once the manager has started, returning a non-nil error that propagates to os.Exit(1). This explains the deterministic ~2s crash, the race is reliably lost in the E2E environment. Unit tests mock the manager so they never hit this path. Fix: move statusserver.SetupServer to the synchronous setup path in main.go, before the goroutine starts. Working on it now.

…r.Start() Signed-off-by: krishdef7 <gargkrish06@gmail.com>

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

Copilot AI review requested due to automatic review settings March 14, 2026 21:49

google-oss-prow bot requested review from jinchihe and kuizhiqing March 14, 2026 21:49

google-oss-prow bot added the size/M label Mar 14, 2026

Copilot started reviewing on behalf of krishdef7 March 14, 2026 21:49 View session

krishdef7 changed the title ~~feat(statusserver): register TrainJobStatus server with controller manager healthz and readyz probes~~ feat(controller): register TrainJobStatus server with controller manager healthz and readyz probes Mar 14, 2026

Copilot AI reviewed Mar 14, 2026

View reviewed changes

pkg/statusserver/server.go Outdated Show resolved Hide resolved

pkg/statusserver/server.go Outdated Show resolved Hide resolved

pkg/statusserver/setup.go Outdated Show resolved Hide resolved

pkg/statusserver/server.go Show resolved Hide resolved

krishdef7 force-pushed the feat/statusserver-healthz-probe branch from bf2f59b to 8fb0b1f Compare March 14, 2026 21:52

krishdef7 changed the title ~~feat(controller): register TrainJobStatus server with controller manager healthz and readyz probes~~ feat(operator): register TrainJobStatus server with controller manager healthz and readyz probes Mar 14, 2026

krishdef7 force-pushed the feat/statusserver-healthz-probe branch from 8fb0b1f to 4f1be1d Compare March 14, 2026 21:58

google-oss-prow bot requested a review from andreyvelich March 14, 2026 22:00

krishdef7 added 3 commits March 15, 2026 04:19

fix(operator): move status server setup to synchronous path before mg…

30e050b

…r.Start() Signed-off-by: krishdef7 <gargkrish06@gmail.com>

fix(operator): defer status server TLS setup until certs ready

9d6bff1

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

chore: fix trailing whitespace in main.go

40f47f1

Signed-off-by: krishdef7 <gargkrish06@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(operator): register TrainJobStatus server with controller manager healthz and readyz probes#3338

feat(operator): register TrainJobStatus server with controller manager healthz and readyz probes#3338
krishdef7 wants to merge 4 commits intokubeflow:masterfrom
krishdef7:feat/statusserver-healthz-probe

krishdef7 commented Mar 14, 2026 •

edited

Loading

Uh oh!

google-oss-prow bot commented Mar 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krishdef7 commented Mar 14, 2026

Uh oh!

google-oss-prow bot commented Mar 14, 2026

Uh oh!

krishdef7 commented Mar 14, 2026

Uh oh!

krishdef7 commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krishdef7 commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Mar 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

krishdef7 commented Mar 14, 2026

Uh oh!

google-oss-prow bot commented Mar 14, 2026

Uh oh!

krishdef7 commented Mar 14, 2026

Uh oh!

krishdef7 commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

krishdef7 commented Mar 14, 2026 •

edited

Loading