feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905

robert-bell · 2025-10-28T14:28:09Z

What this PR does / why we need it:

This is a KEP for adding real-time progress and exposing training metrics on TrainJob.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Part of #2779

Signed-off-by: Rob Bell <[email protected]>

google-oss-prow · 2025-10-28T14:28:16Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

astefanutti · 2025-10-29T13:05:54Z

docs/proposals/2779-trainjob-progress/README.md

+
+### Runtime: emitting trainer status messages
+
+The "primary" training runtime pod will write trainer status messages to stdout in the following format:


We should consider taking inspiration / aligning with the metricsCollectorSpec mechanism from The Katib API.

We can prioritize the support for the StdOut collector type for the first implementation, but that'll give the API enough flexibility and consistency across components.

Hey @astefanutti , Please correct me here :
As described in this Katib's natively supported metrics-collector approaches here (pull/push based): https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/
There are 3 collector types (Stdout/TensorFlowEvent/file based) ..
So for initial phase we can start with adding a metricsCollectorSpec.kind field to our API now by keeping simple regex parsing handled by a metrics collector sidecar, something like this which is exact match for Katib's approach :

spec: metricsCollectorSpec: kind: StdOut source: filter: "([\w|-]+)=([+-]?\d+\.?\d*)"

and then we can introduce file/TensorFlowEvent and custom based metrics collector approaches too in next phases ?

@abhijeet-dhumal exactly, I suggest the KEP to define the API accordingly, but not all the collector types have to be supported in the first implementation.

Yeah that's really an awesome insight.. Thanks @astefanutti 🙌
We will update the proposal promptly !

cc: @robert-bell

Ah I wasn't aware of the metricsCollectorSpec in Katib. Thanks @astefanutti for pointing it out!

Are you thinking about aligning with the Katib approach for consistency and flexibility, or about making it easier for train jobs to send metrics to Katib (e.g. the same metric collection approach can be used by both Katib and KFT)? Or possibly both or something else entirely.

If making it easier for Katib to consume the metrics, would it make sense to consider extending Katib to watch the TrainJob resource instead, possibly from a sidecar or maybe directly in the Katib control plane? The control plane approach could be particularly appealing for user experience as it would mean they would not need to modify their training code at all (provided they use a supported runtime), or depend on the Katib SDK. The train job service account would also not need any additional RBAC permissions.

If you're thinking about consistency and flexibility, do you have an example in mind where the extra flexibility would benefit, or where the proposed approach would be limiting? I agree there's a need for flexibility, but I wonder whether that flexibility can be moved entirely into the runtime code? This may be easier to maintain and extend to add support for other ML frameworks, and would allow us to keep the KFT controller implementation simpler. We could even package some helper utilities to a python package (e.g. kubeflow-utils) to help users instrument their code.

Yes, I'm thinking metrics collection should become the responsibility of the trainer / TrainJob API so these are exposed in a standard way for Katib to consume.

The Trainer would collect metrics from runtimes according to the metricsCollectorSpec pattern, and expose them in a standard way, e.g. in the TrainJob status. Then Katib trial controller would watch the TrainJob to retrieve the metrics related to the experiment. This would make Katib being able to operate without a database.

The metricsCollectorSpec pattern could be configured in the ClusterTrainingRuntime is a way that's compatible with the runtimes. For custom trainers, the SDK would be responsible to customize it depending on the provided trainer.

Thanks for the clarification @astefanutti. I've looked at Katib's metricsCollectorSpec and thought about it in more detail, and I don't think we need to or should use the same pattern here.

Rather than putting flexibility in the control plane, we're proposing that we use the runtime pods to get flexibility. Each different ML framework would have different instrumentation code, but all frameworks would communicate using the same "protocol" (e.g. the specific log messages). This means that we don't need any flexibility in the control plane.

The advantages of using code in the runtime is

much more flexibility than can be provided by metricsCollectorSpec,

it's much easier for data scientists to consume: configuration is all in code, rather than split across a yaml file and code,

the control plane can be kept simpler. In particular, I think katib needs a sidecar to collect the metrics.

To make it easier for users to instrument their code, we can leverage the sdk to automatically instrument training jobs submitted if the framework supports it (e.g. for frameworks that have callback mechanisms).

For frameworks that don’t support callbacks, or for users that aren’t creating jobs via the sdk, we could provide a runtime utility library that implements the protocol and allow users to add the instrumentation themselves (e.g. it could be embedded in their runtime images). Or they are free to just not instrument their training loop.

@robert-bell thanks that sounds great!
I should have expressed the requirement for flexibility rather than jumping directly to the implementation details :)

kannon92 · 2025-10-31T03:08:20Z

docs/proposals/2779-trainjob-progress/README.md

+}
+
+
+type TrainJobTrainerStatus struct {


So #2909 and this PR seem to conflict with each other.

In this one, we want to make TrainJob a general solution for deep learning training.

But in 2909, we are hoping to make TrainJob a general solution for HPC solutions (not just deep learning frameworks).

cc @vsoch

I'm not sure its blocking as this project is called TrainJob but I do want to call this out as a descrepancy.

cc @andreyvelich

I think the intention is to better support training jobs that use HPC technologies (MPI, topology and fine grained scheduling, etc.) It's no more strange than having the MPI plugin.

@kannon92 is the conflict that #2909 would allow TrainJob to be used for more general hpc/non-training workloads, but this proposal is specifically for model training workloads?

If so, I'm not sure its a problem because as you say TrainJob is designed for model training. Plus, the extra status information we're proposing is optional so users can use TrainJob for more general hpc workloads if they wanted.

Also just to clarify, this proposal isn't limited to just deep learning frameworks. Other training frameworks, e.g. XGBoost, LightGBM, should be able to use it too.

I agree with @robert-bell, the metrics status might be also very useful for HPC of jobs. We just need to figure out how to make these statuses agnostic to the job you run.

On the other hand, TrainJob's APIs are more appropriate for the training jobs (e.g. trainer, initializer, potentially exporter/checkpointer cc @rst0git).
At some point, we need to discuss whether these abstractions make sense for HPC jobs or we need to re-think something.

These should map quite well to HPC use cases! We tend to have Figures of Merit (FOM) that are derived from different steps or stages. There are often multiple FOM. The metrics look relevant here with the exception of epoch - not everything you run in HPC is going to have an epoch, but AI/ML jobs run with MPI will.

Is there a reason to hard code metrics types (e.g., TrainMetric vs EvalMetric) versus having a common metric type, and then within that type, you could have an optional variable for the context? That would be more flexible to different kinds of applications I think, and still work well for train/eval.

Honestly if its not required that each frameowkr needs to populate these statuses it is probably fine.

Its a bit odd to have a status field that wouldn't be populated for all frameworks but I can live with that.

Is there a reason to hard code metrics types (e.g., TrainMetric vs EvalMetric) versus having a common metric type, and then within that type, you could have an optional variable for the context? That would be more flexible to different kinds of applications I think, and still work well for train/eval.

+1 to design a more flexible metric typing mechanism.

Its a bit odd to have a status field that wouldn't be populated for all frameworks but I can live with that.

@kannon92 - The intention is the field should be populated in all frameworks, but in practice we don't want to make it mandatory to avoid introducing a barrier to adding new frameworks.

Is there a reason to hard code metrics types (e.g., TrainMetric vs EvalMetric) versus having a common metric type, and then within that type, you could have an optional variable for the context? That would be more flexible to different kinds of applications I think, and still work well for train/eval.

@vsoch Were you thinking something like this?

trainingStatus: metrics: - type: train values: loss: 0.8 auroc: 0.7 ... - type: eval values: loss: 0.7 auroc: 0.6 ...

type here could be an arbitrary "label" for the group.

This approach would preserve readability for users doing kubectl describe trainjob name.

For context, my motivation for using separate fields for train and eval metrics is from thinking about supporting potential future integration with HPOs where it may be useful to be able to compare a metric value for train and eval data sets. E.g., it may be possible to detect overfitting by comparing the train and eval loss. I appreciate this isn't particularly strong motivation or completely thought through.

That could work! I was also thinking:

trainingStatus: metrics: - name: loss value: 0.8 type: train - name: auroc value: 0.7 type: train

I like your design because finding a type is a O(1) lookup instead of needing to loop through an entire list (which could be large).

coveralls · 2025-10-31T19:37:51Z

Pull Request Test Coverage Report for Build 18878218330

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage remained the same at 51.675%

Totals
Change from base Build 18813489271:	0.0%
Covered Lines:	1265
Relevant Lines:	2448

💛 - Coveralls

szaher

Thanks very much @robert-bell @abhijeet-dhumal
This is great proposal, it brings a lot of value like

Excellent User Experience: The goal is perfect. A data scientist seeing PROGRESS: 45% and ETA: 15m is a massive win.
No dependency on external components like prometheus or sidecars
Nice SDK integration with transformers library

but it has few problems too

Architectural anti-pattern by allowing control plane to access data plane and stream logs
While the footprint of go-routine is minimal on the operator itself but it abuses the Control Plane by maintaining constant GET request to /api/v1/namespaces/.../pods/.../log?follow=true connection for every single active TrainJob (so assuming scalability to 1000 job is questionable)
Broad Security Permissions for the training operator to access cluster wide pod logs
Error prone as if training is verbose some dataset might have very large line examples or there might be extremely long log message

Another proposed solution is as follows:
Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates.
this approach is much cleaner

No log parsing
No go-routines
No impact on the kubeapi-server for fetching and streaming logs
Only master/primary pod can update the status and other worker pods will check if it doesn't have the right permissions skip updating the status
Let the actual train job itself update itself

Overall, this is excellent and I think we should push it forward.

szaher · 2025-10-31T12:42:32Z

docs/proposals/2779-trainjob-progress/README.md

+
+### Goals
+
+1. **Expose real-time progress information and training metrics through the TrainJobs CR.** For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics.


I would keep training and eval metrics outside the status as Data Scientists have dedicated tools for this like mlflow, tensorboard, wandb, ...etc.
Also, it seems the proposed solution will capture only the last value from the logs which isn't too helpful for the data scientists or ML persona who will prefer something like tensorboard ...etc. for understand what happened during their training process.
This is unlike Katib or HPO where we have an optimization problem and final metrics are important.

Also, it seems the proposed solution will capture only the last value from the logs which isn't too helpful for the data scientists or ML persona who will prefer something like tensorboard ...etc. for understand what happened during their training process.

I think the main target of this KEP is client integration, e.g. CLI (e.g. kubectl), GUI or Katib that would consume those metrics, not the data scientists.persona, and it's indeed not the goal to overlap with tools like Tensorboard.

This is unlike Katib or HPO where we have an optimization problem and final metrics are important.

Exactly Katib would indeed be a candidate client to consume those metrics to carry the optimizations.

szaher · 2025-10-31T21:26:58Z

docs/proposals/2779-trainjob-progress/README.md

+
+### Story 2: Data Scientist / ML Engineer monitoring training jobs
+
+As a data scientist or ML Engineer, I want to see real-time information about my training jobs so I can make decisions on whether those jobs are likely to succeed or whether intervention is required.


Will the controller always stream the logs to provide real-time information?

How does that scale?

Is this design anti-pattern since control plane access data plane ? The control plane's job is to make decisions. It includes the API server, schedulers, and your controller/Operator. Its job is to "establish and enforce policy." It is not designed for high-throughput, high-volume data processing. It's designed for state reconciliation.

It's a fair point that the log streaming approach is mixing control plane with data plane, though I wonder what the max throughput would be in practice? I'd possibly anticipate that applications would be adding at most a few lines to the logs per second, meaning throughput from a single job would be ~kB/s, and throughput from 1000 active train jobs may be ~MB/s which should not be particularly onerous on the controller. This is something that could be benchmarked of course.

szaher · 2025-10-31T22:29:53Z

docs/proposals/2779-trainjob-progress/README.md

+    trainMetrics: dict[str, float] | None
+    evalMetrics: dict[str, float] | None


Those are very generic and not well defined. What metrics are we expecting. Since we are going to develop the callback in the SDK, should this be more well defined or structured to identify which metrics will be available?

I believe last value from training metrics much not be of much value to the data scientists who are more interested in how specific metric is behaving or doing over the training period. Also, this overlaps with things like MLFlow, TensorBoard, ...etc. which we already have a proposal to included it in kubeflow here kubeflow/community#892

progressPercentage, estimatedRemainingSeconds, currentEpoch, ...etc. give very good information already about the training process so I suggest we get rid of trainingMetrics and evalMetrics in favor of MLFlow

I believe last value from training metrics much not be of much value to the data scientists who are more interested in how specific metric is behaving or doing over the training period. Also, this overlaps with things like MLFlow, TensorBoard, ...etc. which we already have a proposal to included it in kubeflow here kubeflow/community#892

As being discussed in the comment above, the main audience / target of this KEP is not the data scientists persona, and it's indeed not the goal to overlap with tools like Tensorboard nor with the scope of experimentation tracking.

The main target of this KEP is client integration, e.g. GUI or Katib that would consume those metrics.

astefanutti · 2025-11-03T07:40:20Z

Another proposed solution is as follows: Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates. this approach is much cleaner

No log parsing

No go-routines

No impact on the kubeapi-server for fetching and streaming logs

Only master/primary pod can update the status and other worker pods will check if it doesn't have the right permissions skip updating the status

Let the actual train job itself update itself

To add to this analysis, given the TrainJob "primary" Pods run arbitrary user code and extra packages, granting these Pods privilege access to the API server would warrant hardened security sandboxing.

robert-bell · 2025-11-06T18:10:04Z

Another proposed solution is as follows: Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates. this approach is much cleaner

To add to this analysis, given the TrainJob "primary" Pods run arbitrary user code and extra packages, granting these Pods privilege access to the API server would warrant hardened security sandboxing.

Granting permissions so the train job itself can directly update the training status comes with a few more downsides -

The controller would also need to create a rolebinding for the serviceaccount, which may need some care to avoid privilege escalations (I'm not too sure of the details).
The controller may need to provide additional ways to configure the service account so users can, e.g. grant additional permissions to the service account (e.g. if the TrainJob needs to call the K8s api), or configure out-of-cluster permissions, e.g. for AWS IRSA.
If users are using a custom service account for their train job, they would need to manually add a role binding to allow the service account to update the train job.
The training runtime pods would require a kubernetes client to be available. Whilst this could be done (e.g. installing at the start of the job, or by using a sidecar containing the client to export the progress updates) it adds some complexity.

andreyvelich · 2025-11-25T00:52:28Z

Hi Folks, thanks again for this effort! That feature should significantly improve experience of AI workloads on Kubernetes, and has a lot of future potential for checkpointing, scheduling, etc. 🎉

@astefanutti @robert-bell Do we need to update KEP based on our conversation with @astefanutti and @tenzen-y at KubeCon, so we can review and move this forward?

cc @romanbaron @EkinKarabulut @Ronkahn21 as we discussed, this should address problems to detect running status of TrainJob mentioning by Ron in this talk: https://sched.co/28D8Z

feat(docs): KEP-2779 Track TrainJob progress and expose training metrics

29c7ad4

Signed-off-by: Rob Bell <[email protected]>

google-oss-prow bot requested review from jinchihe and kuizhiqing October 28, 2025 14:28

google-oss-prow bot added the size/L label Oct 28, 2025

robert-bell mentioned this pull request Oct 28, 2025

Track TrainJob progress and expose training metrics #2779

Open

astefanutti reviewed Oct 29, 2025

View reviewed changes

abhijeet-dhumal mentioned this pull request Oct 29, 2025

feat: Add TrainJob progression tracking with real-time status updates #2820

Closed

1 task

kannon92 reviewed Oct 31, 2025

View reviewed changes

robert-bell changed the title ~~KEP-2779: Track TrainJob progress and expose training metrics~~ feat(docs): KEP-2779: Track TrainJob progress and expose training metrics Oct 31, 2025

szaher reviewed Oct 31, 2025

View reviewed changes

kannon92 mentioned this pull request Nov 4, 2025

KEP 467: Fast failure recovery with in place restarts kubernetes-sigs/jobset#1083

Merged


		### Runtime: emitting trainer status messages

		The "primary" training runtime pod will write trainer status messages to stdout in the following format:


		### Goals

		1. Expose real-time progress information and training metrics through the TrainJobs CR. For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics.


		### Story 2: Data Scientist / ML Engineer monitoring training jobs

		As a data scientist or ML Engineer, I want to see real-time information about my training jobs so I can make decisions on whether those jobs are likely to succeed or whether intervention is required.

		trainMetrics: dict[str, float] \| None
		evalMetrics: dict[str, float] \| None

feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905

Are you sure you want to change the base?

feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905

Conversation

robert-bell commented Oct 28, 2025

Uh oh!

google-oss-prow bot commented Oct 28, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coveralls commented Oct 31, 2025

Pull Request Test Coverage Report for Build 18878218330

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

szaher left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti Nov 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

astefanutti commented Nov 3, 2025

Uh oh!

robert-bell commented Nov 6, 2025

Uh oh!

andreyvelich commented Nov 25, 2025

Uh oh!

Reviewers

abhijeet-dhumal Oct 29, 2025 •

edited

Loading

astefanutti Oct 31, 2025 •

edited

Loading

astefanutti Nov 3, 2025 •

edited

Loading