Skip to content

feat: wire TrainJobStatus server into controller manager readyz and healthz probes #3337

@krishdef7

Description

@krishdef7

What you would like to be added?

The TrainJobStatus server introduced in #3227 runs in the same process as the controller manager but is not registered with its /readyz and /healthz probes.

The status server should implement healthz.Checker and register via mgr.AddHealthzCheck and mgr.AddReadyzCheck in pkg/statusserver/setup.go.

Discussed in #3227 and flagged as a follow-up by @andreyvelich.

/kind feature
/area controller

Why is this needed?

Training pods have no way to verify the status server is ready before sending their first update, causing silent failures if the server is still initializing TLS or the OIDC provider. If the server crashes mid-job, the controller pod stays Running with no signal from the existing liveness probe.

Wiring into readyz/healthz is a low-risk pattern already used by the webhook server in the same process.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions