feat(operator): register TrainJobStatus server with controller manager healthz and readyz probes#3338
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR wires the in-process TrainJobStatus HTTPS server into the controller manager’s health and readiness probe endpoints so Kubernetes can detect startup readiness and runtime failures.
Changes:
- Register the status server as a
/healthzand/readyzchecker inSetupServer. - Add an
atomic.Boolreadiness flag to the status server and expose it viaCheck(*http.Request) error. - Add a unit test covering
Check()readiness transitions.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/statusserver/setup.go | Registers the status server with manager healthz/readyz checks. |
| pkg/statusserver/server.go | Tracks readiness state and implements a probe checker. |
| pkg/statusserver/server_test.go | Adds a unit test for the new checker behavior. |
bf2f59b to
8fb0b1f
Compare
…r readyz and healthz probes Register the TrainJobStatus server with mgr.AddHealthzCheck and mgr.AddReadyzCheck so Kubernetes liveness/readiness probes reflect status server health. Adds atomic ready field to Server, set to true after TLS initialization and before ListenAndServeTLS, set to false on shutdown. Implements healthz.Checker via Check(*http.Request) error. Fixes kubeflow#3337 Signed-off-by: krishdef7 <gargkrish06@gmail.com>
8fb0b1f to
4f1be1d
Compare
|
@krishdef7: GitHub didn't allow me to request PR reviews from the following users: robert-bell. Note that only kubeflow members and repo collaborators can review this PR, and authors cannot review their own PRs. DetailsIn response to this: Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
The E2E failures appear to be pre-existing from #3227 rather than introduced by this PR. The controller crashes within ~2s with |
|
Correction to my earlier comment: the E2E failures are introduced by this PR, not pre-existing from #3227. Root cause identified: |
…r.Start() Signed-off-by: krishdef7 <gargkrish06@gmail.com>
Signed-off-by: krishdef7 <gargkrish06@gmail.com>
Signed-off-by: krishdef7 <gargkrish06@gmail.com>
What this PR does / why we need it
The TrainJobStatus server introduced in #3227 runs in the same process as the controller manager but was not registered with its /healthz and /readyz probes. This meant:
This PR follows the same pattern already used by the webhook server in the same process.
Changes
atomic.BooltoServer, set totrueafterListenAndServeTLSstarts and false on shutdownhealthz.CheckerviaCheck(*http.Request) erroronServerhealthz.Pingfor liveness and a closure-based readyz check withmgr.AddHealthzCheck/mgr.AddReadyzCheckin main.go, beforemgr.Start(), probe registration must happen before the manager startsstatusserver.SetupServerinsidesetupManagerComponentsgoroutine after<-certsReady, sinceSetupTLSConfigrequires cert files to exist on diskTestServerCheckcovering not-ready → ready → not-ready transitionsFixes #3337
Checklist
No docs change needed, this is an internal health signaling improvement with no new user-facing API surface.