Skip to content

feat: add support for tracking TrainJob progress and training metrics#3227

Merged
andreyvelich merged 65 commits intokubeflow:masterfrom
robert-bell:trainjob-progress
Mar 13, 2026
Merged

feat: add support for tracking TrainJob progress and training metrics#3227
andreyvelich merged 65 commits intokubeflow:masterfrom
robert-bell:trainjob-progress

Conversation

@robert-bell
Copy link
Contributor

@robert-bell robert-bell commented Feb 19, 2026

What this PR does / why we need it:

This PR implements TrainJobProgress (#2905), enabling real-time progress and metrics tracking for TrainJobs. Training pods can now push status updates (progress %, estimated time remaining, custom metrics) which are exposed via status.trainerStatus in the TrainJob CR.

It's still a WIP, and there's a few bits that still need working out, but I'd be keen for any feedback on the current approach.

Headline changes:

  • added a new trainerStatus field to the TrainJob status, using the spec from feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905
  • added new alpha feature gate TrainJobProgress, defaults to disabled. Everything is disabled if the gate is disabled.
  • added a new "progress server" in the controller that exposes an https endpoint for collecting progress updates from the runtime pods.
    • server reuses the existing kubeflow-trainer-controller-manager service
    • the server has TLS configured and reuses the webhook certs. Cert rotation is handled automatically using the existing cert rotator pattern.
    • there's basic middleware: auth-n, logging, panic recovery, body size limits
    • auth-n checks the request has a valid projected service account token.
    • auth-z, which checks that the auth token was granted to a pod that is part of the train job (using a label on the pod)
    • the server directly updates the trainerStatus field.
  • a new "progress" plugin that injects the runtime config into the training pods:
    • env vars: KUBEFLOW_TRAINER_STATUS_URL, KUBEFLOW_TRAINER_STATUS_CA_CERT, KUBEFLOW_TRAINER_STATUS_TOKEN
    • the projected service account token for authenticating with the progress server
    • the control plane ca.crt copied into a configmap which the runtime pods mount to trust the progress server tls
    • adds a label to the pods so we can identify which train job it belongs to (for auth-z).
  • adds new config section for the progress server
  • add ability to set feature-gates in a command line argument
    • the command line takes precedence over the config file

The implementation mostly follows what we agreed in #2905, with a few changes worth pointing out -

  1. the auth-n uses oidc discovery rather than using the TokenReview api. I've done this to avoid the auth-n path needing to make a request to the api. I read into this a bit more, and it looks like we're able to do this because the tokens are always projected service account tokens, they'll always be JWTs signed by the api server. I'd be happy to
  2. I've given the progress server a separate go-client so it gets separate client-side rate limits and won't impact the client used in the main reconciler.
  3. We never discussed the response types form the progress server, so I've suggested successful requests return a 200 code and the original payload, and error responses return a metav1.Status object to align with the k8s api server. I'd be happy to take any input on this.

There's a few TODOs left before this is ready for actually merging - I'll work through these but please do start reviewing these changes as I'd like to check folk are happy with the general approach.

  • wiring up the progress plugin to use the config (e.g. the server port, the service name). I'd actually appreciate some guidance on how best to pass the config through.
  • tidy up some of the constants, e.g. to avoid duplication, move to more sensible location.
  • add e2e test
  • better test coverage for server auth-z
  • (probably?) update the existing integration tests so the progress feature gate is enabled.
  • docs updates (possible leave until we've got consensus on the implementation?)
  • sdk updates to help instrument the runtime (again, leave until we've got consensus on the implementation?)

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part of #2779

Checklist:

  • Docs included if any changes are user facing

@robert-bell
Copy link
Contributor Author

robert-bell commented Feb 19, 2026

cc @andreyvelich @tenzen-y @astefanutti @akshaychitneni

The progress tracking implementation is ready for an initial review once you have bandwidth. I've still some bits to work through (I'm updating the task list as I go along), but I'd appreciate any early feedback on the approach I've taken.

Particular areas I'd be keen for feedback on -

  1. I made a new plugin to inject the runtime config. This felt a clean way to do it, but I appreciate this isn't really an ML-framework plugin, so I'd be happy to take a different approach.
  2. Because I've gone with the plugin approach, I needed to get access to a config object in the plugin so I updated the plugin interface, the registry wiring up code and all the plugins. Giving plugins access to the config doesn't feel unreasonable, but it has caused a pretty big cascade of changes. I've kept it all isolated in a single commit so I'd be happy to change the implementation if there's a better approach.
  3. I'm reusing the existing service and the cfg.CertManagement.WebhookServiceName config setting to get the progress endpoint service name. The name for that config setting now feels wrong, it's not just the webhook service. Could we consider making a breaking change to the name of that config setting- maybe ControllerManagerServiceName? Or is there another approach we could take?

@robert-bell robert-bell force-pushed the trainjob-progress branch 3 times, most recently from 9914660 to 0181d6a Compare February 24, 2026 16:44
@robert-bell robert-bell marked this pull request as ready for review February 24, 2026 16:45
Copilot AI review requested due to automatic review settings February 24, 2026 16:45
@google-oss-prow google-oss-prow bot requested a review from kuizhiqing February 24, 2026 16:45
@robert-bell robert-bell changed the title [WIP] feat: add support for tracking TrainJob progress and training metrics feat: add support for tracking TrainJob progress and training metrics Feb 24, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements the TrainJobProgress KEP by adding a progress/metrics reporting surface to TrainJob and wiring an authenticated HTTPS “progress server” plus a runtime plugin that injects the required runtime configuration into training pods.

Changes:

  • Add status.trainerStatus (progress %, ETA, metrics, timestamp) to the TrainJob API and generated clients/specs.
  • Introduce TrainJobProgress alpha feature gate and a controller-side HTTPS progress server with authn/authz + rate-limited client.
  • Add a runtime framework “progress” plugin that injects env vars, a projected SA token, and a CA bundle ConfigMap mount into trainer pods; add e2e coverage.

Reviewed changes

Copilot reviewed 74 out of 75 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/integration/framework/framework.go Pass default controller configuration into runtime initialization for integration tests.
test/e2e/testdata/progress.py E2E helper script that posts a progress payload from inside a training container.
test/e2e/e2e_test.go Adds an e2e that validates trainerStatus gets updated via the progress endpoint.
pkg/webhooks/trainjob_webhook_test.go Updates webhook tests for new runtime initialization signature (config param).
pkg/util/testing/config.go Adds a test helper to build a defaulted Configuration object.
pkg/util/runtime/runtime.go Adds shared helper to detect operator namespace (with local default).
pkg/util/cert/cert.go Refactors namespace lookup and adds cert-watcher-backed TLS config helper.
pkg/runtime/framework/plugins/volcano/volcano_test.go Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/volcano/volcano.go Updates plugin constructor signature to accept config.
pkg/runtime/framework/plugins/torch/torch_test.go Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/torch/torch.go Updates plugin constructor signature to accept config.
pkg/runtime/framework/plugins/registry.go Extends plugin factory signature to pass config; conditionally registers Progress plugin behind feature gate.
pkg/runtime/framework/plugins/progress/progress_test.go Unit tests for Progress plugin injection and ConfigMap creation behavior.
pkg/runtime/framework/plugins/progress/progress.go New Progress plugin that injects env/token/CA mount + creates CA ConfigMap per TrainJob.
pkg/runtime/framework/plugins/plainml/plainml_test.go Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/plainml/plainml.go Updates plugin constructor signature to accept config.
pkg/runtime/framework/plugins/mpi/mpi_test.go Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/mpi/mpi.go Updates plugin constructor signature to accept config.
pkg/runtime/framework/plugins/jobset/jobset_test.go Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/jobset/jobset.go Updates plugin constructor signature to accept config.
pkg/runtime/framework/plugins/jax/jax_test.go Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/jax/jax.go Updates plugin constructor signature to accept config.
pkg/runtime/framework/plugins/coscheduling/coscheduling_test.go Updates plugin constructor signature in tests (adds cfg param).
pkg/runtime/framework/plugins/coscheduling/coscheduling.go Updates plugin constructor signature to accept config.
pkg/runtime/framework/core/framework_test.go Updates framework construction in tests to pass default config.
pkg/runtime/framework/core/framework.go Threads config through plugin factories during framework initialization.
pkg/runtime/core/trainingruntime_test.go Updates runtime tests to pass config into runtime construction.
pkg/runtime/core/trainingruntime.go Threads config into framework initialization.
pkg/runtime/core/registry.go Updates runtime registrar factory signature to accept config.
pkg/runtime/core/core.go Threads config through runtime registry initialization.
pkg/runtime/core/clustertrainingruntime_test.go Updates cluster runtime tests to pass config into runtime construction.
pkg/runtime/core/clustertrainingruntime.go Updates cluster runtime constructor signature to accept config.
pkg/progress/setup.go Adds manager wiring for the progress server (TLS watcher, verifier, separate client).
pkg/progress/server_test.go Adds server tests for success + error responses and auth behavior.
pkg/progress/server.go Implements HTTPS progress server endpoint that patches status.trainerStatus with authz based on pod label + token claims.
pkg/progress/middleware_test.go Adds recovery middleware test.
pkg/progress/middleware.go Adds middleware chain: recovery, logging, authn, body-size limits.
pkg/progress/constants.go Adds shared label/audience constants and status URL helper.
pkg/progress/auth.go Adds projected SA token OIDC verification + issuer discovery from in-cluster token.
pkg/features/features.go Introduces TrainJobProgress alpha feature gate defaulting to disabled.
pkg/controller/trainjob_controller.go Avoids reconciler overwriting externally-managed status.trainerStatus updates.
pkg/config/validation.go Adds validation for ProgressServer config (port/qps/burst).
pkg/config/config_test.go Extends config tests to cover ProgressServer defaults and file loading.
pkg/client/applyconfiguration/utils.go Adds applyconfiguration mappings for Metric and TrainJobTrainerStatus.
pkg/client/applyconfiguration/trainer/v1alpha1/trainjobtrainerstatus.go Generated applyconfiguration for TrainJobTrainerStatus.
pkg/client/applyconfiguration/trainer/v1alpha1/trainjobstatus.go Adds TrainerStatus to applyconfiguration for TrainJobStatus.
pkg/client/applyconfiguration/trainer/v1alpha1/metric.go Generated applyconfiguration for Metric.
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go Updates OpenAPI defs for Metric/ProgressStatus/TrainJobTrainerStatus and status field.
pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go Adds deepcopy support for new API types/fields.
pkg/apis/trainer/v1alpha1/trainjob_types.go Adds status.trainerStatus, Metric, ProgressStatus, and TrainJobTrainerStatus types + validations.
pkg/apis/config/v1alpha1/zz_generated.deepcopy.go Adds deepcopy support for ProgressServer config.
pkg/apis/config/v1alpha1/defaults.go Adds defaulting for ProgressServer config.
pkg/apis/config/v1alpha1/configuration_types.go Adds ProgressServer config API type and Configuration field.
manifests/base/rbac/role.yaml Grants controller permission to get pods (required for progress server authz).
manifests/base/manager/manager.yaml Exposes the progress-server container port and service port.
manifests/base/manager/controller_manager_config.yaml Adds progressServer config defaults to the base config.
manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml Updates CRD schema to include status.trainerStatus.
hack/e2e-setup-gpu-cluster.sh Enables TrainJobProgress feature gate for e2e cluster setup.
hack/e2e-setup-cluster.sh Enables TrainJobProgress feature gate for e2e cluster setup.
go.sum Adds module checksums for OIDC dependencies.
go.mod Adds go-oidc dependency (and go-jose indirect).
cmd/trainer-controller-manager/main.go Adds --feature-gates, threads config into runtimes, and conditionally starts progress server.
charts/kubeflow-trainer/values.yaml Adds progressServer config values.
charts/kubeflow-trainer/templates/manager/service.yaml Adds Service port for progress-server.
charts/kubeflow-trainer/templates/manager/deployment.yaml Adds progress-server container port.
charts/kubeflow-trainer/templates/manager/configmap.yaml Renders progressServer config into manager configmap.
charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml Updates Helm CRD schema to include status.trainerStatus.
charts/kubeflow-trainer/README.md Updates values documentation to include progressServer config.
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_train_job_trainer_status.py Adds generated Python model for TrainJobTrainerStatus.
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_train_job_status.py Adds TrainerStatus field to generated Python TrainJobStatus model.
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_progress_status.py Adds generated Python model for ProgressStatus.
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_metric.py Adds generated Python model for Metric.
api/python_api/kubeflow_trainer_api/models/init.py Exports newly generated Python models.
api/openapi-spec/swagger.json Updates published OpenAPI spec with new schemas/fields.
Makefile Includes pkg/progress in controller-gen targets (CRDs/RBAC generation).

Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @robert-bell!

I would suggest to use "TrainJobStatus" or "TrainJobRuntimeStatus" instead of TrainJobProgress to be more future-proof.

TrainerStatus: progressStatus.TrainerStatus,
},
}
if err := s.client.Status().Patch(r.Context(), &trainJob, client.Merge); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use SSA with a dedicated manager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I've updated to SSA, but I'm not sure if I used the right approach - would you mind double checking? I've updated the client to use apply and the generated applyconfiguration client code, and reverted the change in the trainjob reconciler that was unsetting the trainer status.

I couldn't get the existing unit tests to work with SSA - the client apply is failing with Operation cannot be fulfilled on trainjobs.trainer.kubeflow.org "test-job": object was modified. Is that expected - is it what the TODO on the trainjob controller is referring to?

// TODO(astefanutti): Consider using SSA once controller-runtime client has SSA support
// for sub-resources. See: kubernetes-sigs/controller-runtime#3183

Copy link
Contributor

@astefanutti astefanutti Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes you've taken the right approach on the status server side with:

s.client.Status().Apply(r.Context(), trainJob, client.ForceOwnership,

Now you can apply it to the controller side, which is indeed what that TODO is about.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for confirming. If it's OK can we address that TODO in a separate PR to keep this one focused. This PR doesn't depend on that TODO being fixed.

"github.com/kubeflow/trainer/v2/pkg/util/cert"
)

func SetupServer(mgr ctrl.Manager, cfg *configapi.ProgressServer) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should there be a network policy that only authorizes ingress from TrainJob Pods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that'd be a nice extra layer though the controller pod is also serving metrics and the webhook so I think the netpol would also need to explicitly allow ingress into those ports from any IP. I also don't think I can disable the netpol with the feature-gate so it'd always be active.

Are we OK with that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ingress rules for metrics and webhooks could probably be restricted to allowed incoming namespaces.

Also the NWP could be dynamically applied at start-time.

This can be tackled in a follow-up PR to keep the scope of the PR focused in the KEP.

}

// Verify the pod has the label identifying it belongs to this TrainJob
trainJobNameFromLabel, ok := pod.Labels[LabelTrainJobName]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using a finer-grained audience avoid getting Pods?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking something like putting the train job name in the audience? That would let us avoid looking up the pod. It'd also mean I wouldn't need to decode the sa token which is cleaner.

Did you have any thoughts about the token format? It might be worth including the whole path in there for future extensibility - something like this? Or is this overkill?

trainer.kubeflow.org/v1alpha1/namespace/{namespace}/trainjobs/{name}/status

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to add - I think encoding information in the audience is non-conventional. It's normally just supposed to be a static string identifying the target service.

I think it works for us because the pods should only ever be updating their parent train job, but just flagging because it could introduce brittleness in the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking something like putting the train job name in the audience?

Yes

I think encoding information in the audience is non-conventional. It's normally just supposed to be a static string identifying the target service.

That's a fair point I agree, though I'm not sure if "conventional" usage is very well defined either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious if we would lose the pod binding check? would it mean that the tokens used with deleted pods will be allowed to update status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed an update with that uses a different audience per train job. @astefanutti please take a look.

Curious if we would lose the pod binding check? would it mean that the tokens used with deleted pods will be allowed to update status?

Yeah, we do lose this, but I think the practical impact is fairly limited because the endpoint itself has fairly limited capabilities and the token expiry is fairly short (1hr). Happy to discuss alternatives though.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert-bell That looks great, I agree with you the impact is limited in practice.

}

verifier := provider.Verifier(&oidc.Config{
ClientID: TokenAudience,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it also verify issuer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep - the issuer is configured when the provider is created, and the verifier only accepts that issuer.

// authorizeRequest checks whether the service account token bearer token used by this request comes from
// a pod that is part of the TrainJob that is being updated.
func (s *Server) authorizeRequest(r *http.Request, namespace, trainJobName string) bool {
token, ok := serviceAccountTokenFromContext(r.Context())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to have auth in the handler with saving token in the context vs using middleware

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong reason. I kept it in the handler mainly because the authz logic depends on the job name/namespace from the request path. Keeping it in the handler makes that dependency more explicit, whereas in middleware it would be a bit more implicit and coupled to path parsing.

Btw, in the latest version I'm not storing the parsed token in the context. The authn has also moved to the handler because it now depends on job name/namespace too.

Copy link
Contributor

@akshaychitneni akshaychitneni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @robert-bell

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this awesome work @robert-bell! Overall looks great!
I left a few comments, will check tests soon as well.

}

mux := http.NewServeMux()
mux.HandleFunc("POST "+StatusUrl("{namespace}", "{name}"), s.handleTrainJobRuntimeStatus)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to implement health check endpoint, so our instrumentation can check if server is healthy before reporting metrics?
Similar to /readyz for Kubernetes API server: https://kubernetes.io/docs/reference/using-api/health-checks/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good shout - yes it's needed. It needs to be implemented in the existing health checks because the server is in the same process as the controllers and webhooks, but it looks like the controller Manager will let me add extra checks to the probes.

It looks like it might need a bit of rejigging. Leave it with me and try implement it this week.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert-bell Sounds good, we can also do that in the followup PR. Let's just create an issue to track it.

WithTrainerStatus(toApplyConfig(runtimeStatus.TrainerStatus)),
)

if err := s.client.Status().Apply(r.Context(), trainJob, client.ForceOwnership, client.FieldOwner("trainer-status")); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if client calls server multiple times?
Do we need to have any guardrails for client calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Client side rate-limiting gets triggered which rate limits the api calls - I've tested it works manually. FYI - I deliberately made a separate client for the server to prevent any rate-limiting affecting the reconcile loop.

We could look at additional protection, but I wonder whether it'd be more effective to put that protection in the instrumentation code to go in the sdk? wdyt?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably. Let's try to discuss it once we have initial implementation.

@robert-bell
Copy link
Contributor Author

robert-bell commented Mar 3, 2026

@andreyvelich @astefanutti @akshaychitneni just a quick update - I'm still working through the changes from your comments. I've tried to remember to resolve the threads I've addressed so you could take a look if you wanted. Otherwise I'm hoping I can have the existing comments addressed before the end of the week.

I left a few comments, will check tests soon as well.

@andreyvelich just fyi - some of the tests still need updating/have been temporarily removed. I'll cycle back to them once the implementation settles down. Do feel free to check out the existing tests though, I'd really welcome any feedback.

@robert-bell
Copy link
Contributor Author

robert-bell commented Mar 6, 2026

Hey @andreyvelich, @astefanutti, @akshaychitneni, I've pushed another update and rebased and I think this is ready for re-review.

I think I've addressed all the comments bar the ones that sounded like more discussion was needed -- I'm happy to update these once folk are happy with the way forward. Please can you take another look at the threads.

  • Name of Feature Gate - TrainJobStatus? thread
  • Name of server package - statusserver? thread
  • Name of plugin - trainjob-status? thread
  • Name for port in manifests - trainjob-status-server? thread
  • Package for OperatorNamespace - revert my change? thread
  • Rate limiting for new endpoint - move to follow on issue? thread

I've kept track of the follow on issues that've been requested - I'll open them once the PR is merged if that's ok

  • Protect server with netpol thread
  • Rename CertManagement.WebhookServiceName config thread
  • Move enableHTTP2 to Trainer config thread
  • Integrate status server into readyz and healthz thread
  • Use SSA in TrainJob reconcile loop thread
  • New interface for plugin - thread

It'd be nice to have envtest integration tests, but I haven't included it. Could we leave it for a follow on PR? It'll be fairly chunky.

Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Signed-off-by: Rob Bell <robell@redhat.com>
@andreyvelich
Copy link
Member

/lgtm
/assign @astefanutti

@google-oss-prow google-oss-prow bot added the lgtm label Mar 13, 2026
Copy link
Contributor

@astefanutti astefanutti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @robert-bell!

/lgtm
/approve

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: astefanutti

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@krishdef7
Copy link
Contributor

@robert-bell here's the prototype: https://github.com/krishdef7/kubeflow-llm-trainer-prototype

Still a work in progress and actively evolving, but it covers the core instrumentation. The progress.py module implements KubeflowProgressReporter and KubeflowTrainerCallback using the updated env var names from @astefanutti's suggestion (KUBEFLOW_TRAINER_SERVER_URL, KUBEFLOW_TRAINER_SERVER_CA_CERT, KUBEFLOW_TRAINER_SERVER_TOKEN). The trl_runner.py and unsloth_runner.py entrypoints automatically inject the callback when those vars are detected, zero user configuration needed.

The callback reports loss, reward/KL divergence, and step/epoch progress on each logging step, covering the metrics I mentioned in my earlier comment.

Two things I noticed from the client side while building this:

  1. CA cert ambiguity: is KUBEFLOW_TRAINER_SERVER_CA_CERT injected as a file path (the ConfigMap mount) or as the inline PEM content? I've implemented it as inline PEM in the SSL context builder, but if it's a path I'll need to adjust.

  2. No readiness signal: the entrypoint has no way to know if the status server is ready before training starts, so the first report() call can fail silently if the server hasn't initialized yet. A /readyz endpoint (which @andreyvelich already flagged as a follow-up) would also let the SDK do a health check before the first update.

Also flagging a known gap: QLoRA with TRL requires a BitsAndBytesConfig at model load time, which means the model can't be loaded internally by TRL from a string path. The current entrypoint handles SFT/DPO/ORPO/KTO correctly but QLoRA needs explicit model loading first, happy to discuss the right approach.

The prototype also includes TorchTune and Unsloth backends with the same callback wired in. Happy to iterate on any of this once #3227 lands.

@andreyvelich
Copy link
Member

I will merge this PR manually given that we have some outages with GPU runners (cc @jaiakash @koksay)

@andreyvelich andreyvelich merged commit 66d0b0b into kubeflow:master Mar 13, 2026
32 of 33 checks passed
@google-oss-prow google-oss-prow bot added this to the v2.2 milestone Mar 13, 2026
@andreyvelich andreyvelich deleted the trainjob-progress branch March 13, 2026 20:01
@robert-bell
Copy link
Contributor Author

Thanks @andreyvelich, and thank you and @astefanutti and @akshaychitneni for your reviews. I'm really excited to have this over the line. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants