feat: add support for tracking TrainJob progress and training metrics by robert-bell · Pull Request #3227 · kubeflow/trainer

robert-bell · 2026-02-19T16:47:52Z

What this PR does / why we need it:

This PR implements TrainJobProgress (#2905), enabling real-time progress and metrics tracking for TrainJobs. Training pods can now push status updates (progress %, estimated time remaining, custom metrics) which are exposed via status.trainerStatus in the TrainJob CR.

It's still a WIP, and there's a few bits that still need working out, but I'd be keen for any feedback on the current approach.

Headline changes:

added a new trainerStatus field to the TrainJob status, using the spec from feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905
added new alpha feature gate TrainJobProgress, defaults to disabled. Everything is disabled if the gate is disabled.
added a new "progress server" in the controller that exposes an https endpoint for collecting progress updates from the runtime pods.
- server reuses the existing kubeflow-trainer-controller-manager service
- the server has TLS configured and reuses the webhook certs. Cert rotation is handled automatically using the existing cert rotator pattern.
- there's basic middleware: auth-n, logging, panic recovery, body size limits
- auth-n checks the request has a valid projected service account token.
- auth-z, which checks that the auth token was granted to a pod that is part of the train job (using a label on the pod)
- the server directly updates the trainerStatus field.
a new "progress" plugin that injects the runtime config into the training pods:
- env vars: KUBEFLOW_TRAINER_STATUS_URL, KUBEFLOW_TRAINER_STATUS_CA_CERT, KUBEFLOW_TRAINER_STATUS_TOKEN
- the projected service account token for authenticating with the progress server
- the control plane ca.crt copied into a configmap which the runtime pods mount to trust the progress server tls
- adds a label to the pods so we can identify which train job it belongs to (for auth-z).
adds new config section for the progress server
add ability to set feature-gates in a command line argument
- the command line takes precedence over the config file

The implementation mostly follows what we agreed in #2905, with a few changes worth pointing out -

the auth-n uses oidc discovery rather than using the TokenReview api. I've done this to avoid the auth-n path needing to make a request to the api. I read into this a bit more, and it looks like we're able to do this because the tokens are always projected service account tokens, they'll always be JWTs signed by the api server. I'd be happy to
I've given the progress server a separate go-client so it gets separate client-side rate limits and won't impact the client used in the main reconciler.
We never discussed the response types form the progress server, so I've suggested successful requests return a 200 code and the original payload, and error responses return a metav1.Status object to align with the k8s api server. I'd be happy to take any input on this.

There's a few TODOs left before this is ready for actually merging - I'll work through these but please do start reviewing these changes as I'd like to check folk are happy with the general approach.

wiring up the progress plugin to use the config (e.g. the server port, the service name). I'd actually appreciate some guidance on how best to pass the config through.
tidy up some of the constants, e.g. to avoid duplication, move to more sensible location.
add e2e test
better test coverage for server auth-z
(probably?) update the existing integration tests so the progress feature gate is enabled.
docs updates (possible leave until we've got consensus on the implementation?)
sdk updates to help instrument the runtime (again, leave until we've got consensus on the implementation?)

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part of #2779

Checklist:

Docs included if any changes are user facing

robert-bell · 2026-02-19T16:55:00Z

cc @andreyvelich @tenzen-y @astefanutti @akshaychitneni

The progress tracking implementation is ready for an initial review once you have bandwidth. I've still some bits to work through (I'm updating the task list as I go along), but I'd appreciate any early feedback on the approach I've taken.

Particular areas I'd be keen for feedback on -

I made a new plugin to inject the runtime config. This felt a clean way to do it, but I appreciate this isn't really an ML-framework plugin, so I'd be happy to take a different approach.
Because I've gone with the plugin approach, I needed to get access to a config object in the plugin so I updated the plugin interface, the registry wiring up code and all the plugins. Giving plugins access to the config doesn't feel unreasonable, but it has caused a pretty big cascade of changes. I've kept it all isolated in a single commit so I'd be happy to change the implementation if there's a better approach.
I'm reusing the existing service and the cfg.CertManagement.WebhookServiceName config setting to get the progress endpoint service name. The name for that config setting now feels wrong, it's not just the webhook service. Could we consider making a breaking change to the name of that config setting- maybe ControllerManagerServiceName? Or is there another approach we could take?

Copilot

Pull request overview

Implements the TrainJobProgress KEP by adding a progress/metrics reporting surface to TrainJob and wiring an authenticated HTTPS “progress server” plus a runtime plugin that injects the required runtime configuration into training pods.

Changes:

Add status.trainerStatus (progress %, ETA, metrics, timestamp) to the TrainJob API and generated clients/specs.
Introduce TrainJobProgress alpha feature gate and a controller-side HTTPS progress server with authn/authz + rate-limited client.
Add a runtime framework “progress” plugin that injects env vars, a projected SA token, and a CA bundle ConfigMap mount into trainer pods; add e2e coverage.

Reviewed changes

Copilot reviewed 74 out of 75 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
test/integration/framework/framework.go	Pass default controller configuration into runtime initialization for integration tests.
test/e2e/testdata/progress.py	E2E helper script that posts a progress payload from inside a training container.
test/e2e/e2e_test.go	Adds an e2e that validates trainerStatus gets updated via the progress endpoint.
pkg/webhooks/trainjob_webhook_test.go	Updates webhook tests for new runtime initialization signature (config param).
pkg/util/testing/config.go	Adds a test helper to build a defaulted Configuration object.
pkg/util/runtime/runtime.go	Adds shared helper to detect operator namespace (with local default).
pkg/util/cert/cert.go	Refactors namespace lookup and adds cert-watcher-backed TLS config helper.
pkg/runtime/framework/plugins/volcano/volcano_test.go	Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/volcano/volcano.go	Updates plugin constructor signature to accept config.
pkg/runtime/framework/plugins/torch/torch_test.go	Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/torch/torch.go	Updates plugin constructor signature to accept config.
pkg/runtime/framework/plugins/registry.go	Extends plugin factory signature to pass config; conditionally registers Progress plugin behind feature gate.
pkg/runtime/framework/plugins/progress/progress_test.go	Unit tests for Progress plugin injection and ConfigMap creation behavior.
pkg/runtime/framework/plugins/progress/progress.go	New Progress plugin that injects env/token/CA mount + creates CA ConfigMap per TrainJob.
pkg/runtime/framework/plugins/plainml/plainml_test.go	Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/plainml/plainml.go	Updates plugin constructor signature to accept config.
pkg/runtime/framework/plugins/mpi/mpi_test.go	Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/mpi/mpi.go	Updates plugin constructor signature to accept config.
pkg/runtime/framework/plugins/jobset/jobset_test.go	Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/jobset/jobset.go	Updates plugin constructor signature to accept config.
pkg/runtime/framework/plugins/jax/jax_test.go	Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/jax/jax.go	Updates plugin constructor signature to accept config.
pkg/runtime/framework/plugins/coscheduling/coscheduling_test.go	Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/coscheduling/coscheduling.go	Updates plugin constructor signature to accept config.
pkg/runtime/framework/core/framework_test.go	Updates framework construction in tests to pass default config.
pkg/runtime/framework/core/framework.go	Threads config through plugin factories during framework initialization.
pkg/runtime/core/trainingruntime_test.go	Updates runtime tests to pass config into runtime construction.
pkg/runtime/core/trainingruntime.go	Threads config into framework initialization.
pkg/runtime/core/registry.go	Updates runtime registrar factory signature to accept config.
pkg/runtime/core/core.go	Threads config through runtime registry initialization.
pkg/runtime/core/clustertrainingruntime_test.go	Updates cluster runtime tests to pass config into runtime construction.
pkg/runtime/core/clustertrainingruntime.go	Updates cluster runtime constructor signature to accept config.
pkg/progress/setup.go	Adds manager wiring for the progress server (TLS watcher, verifier, separate client).
pkg/progress/server_test.go	Adds server tests for success + error responses and auth behavior.
pkg/progress/server.go	Implements HTTPS progress server endpoint that patches `status.trainerStatus` with authz based on pod label + token claims.
pkg/progress/middleware_test.go	Adds recovery middleware test.
pkg/progress/middleware.go	Adds middleware chain: recovery, logging, authn, body-size limits.
pkg/progress/constants.go	Adds shared label/audience constants and status URL helper.
pkg/progress/auth.go	Adds projected SA token OIDC verification + issuer discovery from in-cluster token.
pkg/features/features.go	Introduces `TrainJobProgress` alpha feature gate defaulting to disabled.
pkg/controller/trainjob_controller.go	Avoids reconciler overwriting externally-managed `status.trainerStatus` updates.
pkg/config/validation.go	Adds validation for ProgressServer config (port/qps/burst).
pkg/config/config_test.go	Extends config tests to cover ProgressServer defaults and file loading.
pkg/client/applyconfiguration/utils.go	Adds applyconfiguration mappings for Metric and TrainJobTrainerStatus.
pkg/client/applyconfiguration/trainer/v1alpha1/trainjobtrainerstatus.go	Generated applyconfiguration for TrainJobTrainerStatus.
pkg/client/applyconfiguration/trainer/v1alpha1/trainjobstatus.go	Adds TrainerStatus to applyconfiguration for TrainJobStatus.
pkg/client/applyconfiguration/trainer/v1alpha1/metric.go	Generated applyconfiguration for Metric.
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go	Updates OpenAPI defs for Metric/ProgressStatus/TrainJobTrainerStatus and status field.
pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go	Adds deepcopy support for new API types/fields.
pkg/apis/trainer/v1alpha1/trainjob_types.go	Adds `status.trainerStatus`, Metric, ProgressStatus, and TrainJobTrainerStatus types + validations.
pkg/apis/config/v1alpha1/zz_generated.deepcopy.go	Adds deepcopy support for ProgressServer config.
pkg/apis/config/v1alpha1/defaults.go	Adds defaulting for ProgressServer config.
pkg/apis/config/v1alpha1/configuration_types.go	Adds ProgressServer config API type and Configuration field.
manifests/base/rbac/role.yaml	Grants controller permission to `get` pods (required for progress server authz).
manifests/base/manager/manager.yaml	Exposes the progress-server container port and service port.
manifests/base/manager/controller_manager_config.yaml	Adds progressServer config defaults to the base config.
manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml	Updates CRD schema to include `status.trainerStatus`.
hack/e2e-setup-gpu-cluster.sh	Enables TrainJobProgress feature gate for e2e cluster setup.
hack/e2e-setup-cluster.sh	Enables TrainJobProgress feature gate for e2e cluster setup.
go.sum	Adds module checksums for OIDC dependencies.
go.mod	Adds go-oidc dependency (and go-jose indirect).
cmd/trainer-controller-manager/main.go	Adds `--feature-gates`, threads config into runtimes, and conditionally starts progress server.
charts/kubeflow-trainer/values.yaml	Adds progressServer config values.
charts/kubeflow-trainer/templates/manager/service.yaml	Adds Service port for progress-server.
charts/kubeflow-trainer/templates/manager/deployment.yaml	Adds progress-server container port.
charts/kubeflow-trainer/templates/manager/configmap.yaml	Renders progressServer config into manager configmap.
charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml	Updates Helm CRD schema to include `status.trainerStatus`.
charts/kubeflow-trainer/README.md	Updates values documentation to include progressServer config.
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_train_job_trainer_status.py	Adds generated Python model for TrainJobTrainerStatus.
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_train_job_status.py	Adds TrainerStatus field to generated Python TrainJobStatus model.
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_progress_status.py	Adds generated Python model for ProgressStatus.
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_metric.py	Adds generated Python model for Metric.
api/python_api/kubeflow_trainer_api/models/init.py	Exports newly generated Python models.
api/openapi-spec/swagger.json	Updates published OpenAPI spec with new schemas/fields.
Makefile	Includes pkg/progress in controller-gen targets (CRDs/RBAC generation).

pkg/runtime/framework/plugins/trainjobstatus/trainjobstatus.go

pkg/util/cert/cert.go

test/e2e/testdata/status_update.py

pkg/progress/middleware.go

pkg/status/server_test.go

astefanutti

Thanks @robert-bell!

I would suggest to use "TrainJobStatus" or "TrainJobRuntimeStatus" instead of TrainJobProgress to be more future-proof.

pkg/apis/config/v1alpha1/configuration_types.go

manifests/base/manager/controller_manager_config.yaml

cmd/trainer-controller-manager/main.go

manifests/base/manager/manager.yaml

Makefile

pkg/progress/server.go

pkg/status/setup.go

pkg/status/server.go

astefanutti · 2026-02-25T10:24:00Z

pkg/status/server.go

+			TrainerStatus: progressStatus.TrainerStatus,
+		},
+	}
+	if err := s.client.Status().Patch(r.Context(), &trainJob, client.Merge); err != nil {


Use SSA with a dedicated manager.

I think I've updated to SSA, but I'm not sure if I used the right approach - would you mind double checking? I've updated the client to use apply and the generated applyconfiguration client code, and reverted the change in the trainjob reconciler that was unsetting the trainer status.

I couldn't get the existing unit tests to work with SSA - the client apply is failing with Operation cannot be fulfilled on trainjobs.trainer.kubeflow.org "test-job": object was modified. Is that expected - is it what the TODO on the trainjob controller is referring to?

// TODO(astefanutti): Consider using SSA once controller-runtime client has SSA support
// for sub-resources. See: kubernetes-sigs/controller-runtime#3183

Yes you've taken the right approach on the status server side with:

s.client.Status().Apply(r.Context(), trainJob, client.ForceOwnership,

Now you can apply it to the controller side, which is indeed what that TODO is about.

Thanks for confirming. If it's OK can we address that TODO in a separate PR to keep this one focused. This PR doesn't depend on that TODO being fixed.

pkg/status/server.go

astefanutti · 2026-02-26T10:39:05Z

pkg/progress/setup.go

+	"github.com/kubeflow/trainer/v2/pkg/util/cert"
+)
+
+func SetupServer(mgr ctrl.Manager, cfg *configapi.ProgressServer) error {


Should there be a network policy that only authorizes ingress from TrainJob Pods?

Yeah that'd be a nice extra layer though the controller pod is also serving metrics and the webhook so I think the netpol would also need to explicitly allow ingress into those ports from any IP. I also don't think I can disable the netpol with the feature-gate so it'd always be active.

Are we OK with that?

The ingress rules for metrics and webhooks could probably be restricted to allowed incoming namespaces.

Also the NWP could be dynamically applied at start-time.

This can be tackled in a follow-up PR to keep the scope of the PR focused in the KEP.

astefanutti · 2026-02-26T10:40:14Z

pkg/status/server.go

+	}
+
+	// Verify the pod has the label identifying it belongs to this TrainJob
+	trainJobNameFromLabel, ok := pod.Labels[LabelTrainJobName]


Would using a finer-grained audience avoid getting Pods?

Are you thinking something like putting the train job name in the audience? That would let us avoid looking up the pod. It'd also mean I wouldn't need to decode the sa token which is cleaner.

Did you have any thoughts about the token format? It might be worth including the whole path in there for future extensibility - something like this? Or is this overkill?

trainer.kubeflow.org/v1alpha1/namespace/{namespace}/trainjobs/{name}/status

just to add - I think encoding information in the audience is non-conventional. It's normally just supposed to be a static string identifying the target service.

I think it works for us because the pods should only ever be updating their parent train job, but just flagging because it could introduce brittleness in the future.

Are you thinking something like putting the train job name in the audience?

Yes

I think encoding information in the audience is non-conventional. It's normally just supposed to be a static string identifying the target service.

That's a fair point I agree, though I'm not sure if "conventional" usage is very well defined either.

Curious if we would lose the pod binding check? would it mean that the tokens used with deleted pods will be allowed to update status?

I've pushed an update with that uses a different audience per train job. @astefanutti please take a look.

Curious if we would lose the pod binding check? would it mean that the tokens used with deleted pods will be allowed to update status?

Yeah, we do lose this, but I think the practical impact is fairly limited because the endpoint itself has fairly limited capabilities and the token expiry is fairly short (1hr). Happy to discuss alternatives though.

@robert-bell That looks great, I agree with you the impact is limited in practice.

pkg/client/applyconfiguration/trainer/v1alpha1/metric.go

akshaychitneni · 2026-03-02T17:09:50Z

pkg/status/auth.go

+	}
+
+	verifier := provider.Verifier(&oidc.Config{
+		ClientID: TokenAudience,


Does it also verify issuer?

Yep - the issuer is configured when the provider is created, and the verifier only accepts that issuer.

akshaychitneni · 2026-03-02T17:16:29Z

pkg/status/server.go

+// authorizeRequest checks whether the service account token bearer token used by this request comes from
+// a pod that is part of the TrainJob that is being updated.
+func (s *Server) authorizeRequest(r *http.Request, namespace, trainJobName string) bool {
+	token, ok := serviceAccountTokenFromContext(r.Context())


Any reason to have auth in the handler with saving token in the context vs using middleware

No strong reason. I kept it in the handler mainly because the authz logic depends on the job name/namespace from the request path. Keeping it in the handler makes that dependency more explicit, whereas in middleware it would be a bit more implicit and coupled to path parsing.

Btw, in the latest version I'm not storing the parsed token in the context. The authn has also moved to the handler because it now depends on job name/namespace too.

akshaychitneni

Thanks @robert-bell

andreyvelich

Thank you for this awesome work @robert-bell! Overall looks great!
I left a few comments, will check tests soon as well.

charts/kubeflow-trainer/values.yaml

pkg/client/applyconfiguration/trainer/v1alpha1/metric.go

pkg/apis/trainer/v1alpha1/trainjob_types.go

andreyvelich · 2026-03-03T03:12:28Z

pkg/statusserver/server.go

+	}
+
+	mux := http.NewServeMux()
+	mux.HandleFunc("POST "+StatusUrl("{namespace}", "{name}"), s.handleTrainJobRuntimeStatus)


Do we need to implement health check endpoint, so our instrumentation can check if server is healthy before reporting metrics?
Similar to /readyz for Kubernetes API server: https://kubernetes.io/docs/reference/using-api/health-checks/

good shout - yes it's needed. It needs to be implemented in the existing health checks because the server is in the same process as the controllers and webhooks, but it looks like the controller Manager will let me add extra checks to the probes.

It looks like it might need a bit of rejigging. Leave it with me and try implement it this week.

@robert-bell Sounds good, we can also do that in the followup PR. Let's just create an issue to track it.

andreyvelich · 2026-03-03T03:16:34Z

pkg/statusserver/server.go

+			WithTrainerStatus(toApplyConfig(runtimeStatus.TrainerStatus)),
+		)
+
+	if err := s.client.Status().Apply(r.Context(), trainJob, client.ForceOwnership, client.FieldOwner("trainer-status")); err != nil {


What if client calls server multiple times?
Do we need to have any guardrails for client calls?

Client side rate-limiting gets triggered which rate limits the api calls - I've tested it works manually. FYI - I deliberately made a separate client for the server to prevent any rate-limiting affecting the reconcile loop.

We could look at additional protection, but I wonder whether it'd be more effective to put that protection in the instrumentation code to go in the sdk? wdyt?

Probably. Let's try to discuss it once we have initial implementation.

pkg/status/server.go

robert-bell · 2026-03-03T16:48:00Z

@andreyvelich @astefanutti @akshaychitneni just a quick update - I'm still working through the changes from your comments. I've tried to remember to resolve the threads I've addressed so you could take a look if you wanted. Otherwise I'm hoping I can have the existing comments addressed before the end of the week.

I left a few comments, will check tests soon as well.

@andreyvelich just fyi - some of the tests still need updating/have been temporarily removed. I'll cycle back to them once the implementation settles down. Do feel free to check out the existing tests though, I'd really welcome any feedback.

robert-bell · 2026-03-06T19:37:58Z

Hey @andreyvelich, @astefanutti, @akshaychitneni, I've pushed another update and rebased and I think this is ready for re-review.

I think I've addressed all the comments bar the ones that sounded like more discussion was needed -- I'm happy to update these once folk are happy with the way forward. Please can you take another look at the threads.

Name of Feature Gate - TrainJobStatus? thread
Name of server package - statusserver? thread
Name of plugin - trainjob-status? thread
Name for port in manifests - trainjob-status-server? thread
Package for OperatorNamespace - revert my change? thread
Rate limiting for new endpoint - move to follow on issue? thread

I've kept track of the follow on issues that've been requested - I'll open them once the PR is merged if that's ok

Protect server with netpol thread
Rename CertManagement.WebhookServiceName config thread
Move enableHTTP2 to Trainer config thread
Integrate status server into readyz and healthz thread
Use SSA in TrainJob reconcile loop thread
New interface for plugin - thread

It'd be nice to have envtest integration tests, but I haven't included it. Could we leave it for a follow on PR? It'll be fairly chunky.

Signed-off-by: Rob Bell <robell@redhat.com>

andreyvelich · 2026-03-13T19:06:36Z

/lgtm
/assign @astefanutti

astefanutti

Thanks @robert-bell!

/lgtm
/approve

google-oss-prow · 2026-03-13T19:13:51Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [astefanutti]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

krishdef7 · 2026-03-13T19:46:28Z

@robert-bell here's the prototype: https://github.com/krishdef7/kubeflow-llm-trainer-prototype

Still a work in progress and actively evolving, but it covers the core instrumentation. The progress.py module implements KubeflowProgressReporter and KubeflowTrainerCallback using the updated env var names from @astefanutti's suggestion (KUBEFLOW_TRAINER_SERVER_URL, KUBEFLOW_TRAINER_SERVER_CA_CERT, KUBEFLOW_TRAINER_SERVER_TOKEN). The trl_runner.py and unsloth_runner.py entrypoints automatically inject the callback when those vars are detected, zero user configuration needed.

The callback reports loss, reward/KL divergence, and step/epoch progress on each logging step, covering the metrics I mentioned in my earlier comment.

Two things I noticed from the client side while building this:

CA cert ambiguity: is KUBEFLOW_TRAINER_SERVER_CA_CERT injected as a file path (the ConfigMap mount) or as the inline PEM content? I've implemented it as inline PEM in the SSL context builder, but if it's a path I'll need to adjust.
No readiness signal: the entrypoint has no way to know if the status server is ready before training starts, so the first report() call can fail silently if the server hasn't initialized yet. A /readyz endpoint (which @andreyvelich already flagged as a follow-up) would also let the SDK do a health check before the first update.

Also flagging a known gap: QLoRA with TRL requires a BitsAndBytesConfig at model load time, which means the model can't be loaded internally by TRL from a string path. The current entrypoint handles SFT/DPO/ORPO/KTO correctly but QLoRA needs explicit model loading first, happy to discuss the right approach.

The prototype also includes TorchTune and Unsloth backends with the same callback wired in. Happy to iterate on any of this once #3227 lands.

andreyvelich · 2026-03-13T20:00:56Z

I will merge this PR manually given that we have some outages with GPU runners (cc @jaiakash @koksay)

robert-bell · 2026-03-13T20:07:33Z

Thanks @andreyvelich, and thank you and @astefanutti and @akshaychitneni for your reviews. I'm really excited to have this over the line. 🎉

google-oss-prow bot added the do-not-merge/work-in-progress label Feb 19, 2026

google-oss-prow bot requested review from akshaychitneni and jinchihe February 19, 2026 16:47

google-oss-prow bot added the size/XXL label Feb 19, 2026

robert-bell force-pushed the trainjob-progress branch 3 times, most recently from 9914660 to 0181d6a Compare February 24, 2026 16:44

robert-bell marked this pull request as ready for review February 24, 2026 16:45

Copilot AI review requested due to automatic review settings February 24, 2026 16:45

google-oss-prow bot requested a review from kuizhiqing February 24, 2026 16:45

Copilot started reviewing on behalf of robert-bell February 24, 2026 16:46 View session

robert-bell changed the title ~~[WIP] feat: add support for tracking TrainJob progress and training metrics~~ feat: add support for tracking TrainJob progress and training metrics Feb 24, 2026

google-oss-prow bot removed the do-not-merge/work-in-progress label Feb 24, 2026

Copilot AI reviewed Feb 24, 2026

View reviewed changes

astefanutti reviewed Feb 25, 2026

View reviewed changes

astefanutti reviewed Feb 26, 2026

View reviewed changes

robert-bell force-pushed the trainjob-progress branch from 38221bb to f4b4a30 Compare February 26, 2026 18:21

akshaychitneni reviewed Mar 2, 2026

View reviewed changes

pkg/client/applyconfiguration/trainer/v1alpha1/metric.go Show resolved Hide resolved

akshaychitneni reviewed Mar 2, 2026

View reviewed changes

andreyvelich reviewed Mar 3, 2026

View reviewed changes

robert-bell mentioned this pull request Mar 3, 2026

Remove YEAR from copyright boilerplate #3275

Open

This was referenced Mar 6, 2026

KubeflowCallback: Native progress reporting for Kubernetes-based Kubeflow training huggingface/transformers#44486

Open

feat(integration): Add KubeflowCallback to enable automatic progress … huggingface/transformers#44487

Draft

krishdef7 mentioned this pull request Mar 6, 2026

fix(operator): surface reconciliation errors in TrainJob status #3282

Open

robert-bell force-pushed the trainjob-progress branch from 24e6f31 to 475a596 Compare March 6, 2026 17:44

robert-bell added 16 commits March 13, 2026 19:03

fix: revert inadvertent change

5181861

Signed-off-by: Rob Bell <robell@redhat.com>

chore: update api name and type docs

0b86114

Signed-off-by: Rob Bell <robell@redhat.com>

refactor: initialise oidc provider in manager startup

441f00f

Signed-off-by: Rob Bell <robell@redhat.com>

fix: allow empty update request

bdad285

Signed-off-by: Rob Bell <robell@redhat.com>

fix: wait for server shut down

c0d7440

Signed-off-by: Rob Bell <robell@redhat.com>

fix: handle missing train job

fbf1b5b

Signed-off-by: Rob Bell <robell@redhat.com>

refactor: update env var names

3fc1347

Signed-off-by: Rob Bell <robell@redhat.com>

chore: rename packages, plugins, feature gate

18eaccb

Signed-off-by: Rob Bell <robell@redhat.com>

refactor: update env var names

3bfc54e

Signed-off-by: Rob Bell <robell@redhat.com>

chore: regenerate after rebase

8b6910a

Signed-off-by: Rob Bell <robell@redhat.com>

chore: fixes after rebase

053b59e

Signed-off-by: Rob Bell <robell@redhat.com>

chore: minor doc updates

4ee1a06

Signed-off-by: Rob Bell <robell@redhat.com>

chore: rename status server config

70529fe

Signed-off-by: Rob Bell <robell@redhat.com>

chore: replace placeholder config with nil

899149c

Signed-off-by: Rob Bell <robell@redhat.com>

chore: revert separate runtime utils package

126de75

Signed-off-by: Rob Bell <robell@redhat.com>

chore: remove unused config from tests

0ef0b91

Signed-off-by: Rob Bell <robell@redhat.com>

andreyvelich force-pushed the trainjob-progress branch from 7bab1fa to 0ef0b91 Compare March 13, 2026 19:06

google-oss-prow bot added the lgtm label Mar 13, 2026

astefanutti reviewed Mar 13, 2026

View reviewed changes

google-oss-prow bot added the approved label Mar 13, 2026

andreyvelich merged commit 66d0b0b into kubeflow:master Mar 13, 2026
32 of 33 checks passed

google-oss-prow bot added this to the v2.2 milestone Mar 13, 2026

andreyvelich deleted the trainjob-progress branch March 13, 2026 20:01

This was referenced Mar 14, 2026

feat: wire TrainJobStatus server into controller manager readyz and healthz probes #3337

Open

feat(operator): register TrainJobStatus server with controller manager healthz and readyz probes #3338

Open

Conversation

robert-bell commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

robert-bell commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

astefanutti left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akshaychitneni left a comment

Choose a reason for hiding this comment

Uh oh!

andreyvelich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robert-bell commented Feb 19, 2026 •

edited

Loading

robert-bell commented Feb 19, 2026 •

edited

Loading

astefanutti Mar 6, 2026 •

edited

Loading

robert-bell commented Mar 3, 2026 •

edited

Loading

robert-bell commented Mar 6, 2026 •

edited

Loading