-
Notifications
You must be signed in to change notification settings - Fork 841
feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
feat(docs): KEP-2779: Track TrainJob progress and expose training metrics #2905
Conversation
Signed-off-by: Rob Bell <[email protected]>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
||
| ### Runtime: emitting trainer status messages | ||
|
|
||
| The "primary" training runtime pod will write trainer status messages to stdout in the following format: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should consider taking inspiration / aligning with the metricsCollectorSpec mechanism from The Katib API.
We can prioritize the support for the StdOut collector type for the first implementation, but that'll give the API enough flexibility and consistency across components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @astefanutti , Please correct me here :
As described in this Katib's natively supported metrics-collector approaches here (pull/push based): https://www.kubeflow.org/docs/components/katib/user-guides/metrics-collector/
There are 3 collector types (Stdout/TensorFlowEvent/file based) ..
So for initial phase we can start with adding a metricsCollectorSpec.kind field to our API now by keeping simple regex parsing handled by a metrics collector sidecar, something like this which is exact match for Katib's approach :
spec:
metricsCollectorSpec:
kind: StdOut
source:
filter: "([\w|-]+)=([+-]?\d+\.?\d*)"
and then we can introduce file/TensorFlowEvent and custom based metrics collector approaches too in next phases ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@abhijeet-dhumal exactly, I suggest the KEP to define the API accordingly, but not all the collector types have to be supported in the first implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's really an awesome insight.. Thanks @astefanutti 🙌
We will update the proposal promptly !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc: @robert-bell
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I wasn't aware of the metricsCollectorSpec in Katib. Thanks @astefanutti for pointing it out!
Are you thinking about aligning with the Katib approach for consistency and flexibility, or about making it easier for train jobs to send metrics to Katib (e.g. the same metric collection approach can be used by both Katib and KFT)? Or possibly both or something else entirely.
If making it easier for Katib to consume the metrics, would it make sense to consider extending Katib to watch the TrainJob resource instead, possibly from a sidecar or maybe directly in the Katib control plane? The control plane approach could be particularly appealing for user experience as it would mean they would not need to modify their training code at all (provided they use a supported runtime), or depend on the Katib SDK. The train job service account would also not need any additional RBAC permissions.
If you're thinking about consistency and flexibility, do you have an example in mind where the extra flexibility would benefit, or where the proposed approach would be limiting? I agree there's a need for flexibility, but I wonder whether that flexibility can be moved entirely into the runtime code? This may be easier to maintain and extend to add support for other ML frameworks, and would allow us to keep the KFT controller implementation simpler. We could even package some helper utilities to a python package (e.g. kubeflow-utils) to help users instrument their code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I'm thinking metrics collection should become the responsibility of the trainer / TrainJob API so these are exposed in a standard way for Katib to consume.
The Trainer would collect metrics from runtimes according to the metricsCollectorSpec pattern, and expose them in a standard way, e.g. in the TrainJob status. Then Katib trial controller would watch the TrainJob to retrieve the metrics related to the experiment. This would make Katib being able to operate without a database.
The metricsCollectorSpec pattern could be configured in the ClusterTrainingRuntime is a way that's compatible with the runtimes. For custom trainers, the SDK would be responsible to customize it depending on the provided trainer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification @astefanutti. I've looked at Katib's metricsCollectorSpec and thought about it in more detail, and I don't think we need to or should use the same pattern here.
Rather than putting flexibility in the control plane, we're proposing that we use the runtime pods to get flexibility. Each different ML framework would have different instrumentation code, but all frameworks would communicate using the same "protocol" (e.g. the specific log messages). This means that we don't need any flexibility in the control plane.
The advantages of using code in the runtime is
- much more flexibility than can be provided by
metricsCollectorSpec, - it's much easier for data scientists to consume: configuration is all in code, rather than split across a yaml file and code,
- the control plane can be kept simpler. In particular, I think katib needs a sidecar to collect the metrics.
To make it easier for users to instrument their code, we can leverage the sdk to automatically instrument training jobs submitted if the framework supports it (e.g. for frameworks that have callback mechanisms).
For frameworks that don’t support callbacks, or for users that aren’t creating jobs via the sdk, we could provide a runtime utility library that implements the protocol and allow users to add the instrumentation themselves (e.g. it could be embedded in their runtime images). Or they are free to just not instrument their training loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@robert-bell thanks that sounds great!
I should have expressed the requirement for flexibility rather than jumping directly to the implementation details :)
| } | ||
|
|
||
|
|
||
| type TrainJobTrainerStatus struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So #2909 and this PR seem to conflict with each other.
In this one, we want to make TrainJob a general solution for deep learning training.
But in 2909, we are hoping to make TrainJob a general solution for HPC solutions (not just deep learning frameworks).
cc @vsoch
I'm not sure its blocking as this project is called TrainJob but I do want to call this out as a descrepancy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the intention is to better support training jobs that use HPC technologies (MPI, topology and fine grained scheduling, etc.) It's no more strange than having the MPI plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kannon92 is the conflict that #2909 would allow TrainJob to be used for more general hpc/non-training workloads, but this proposal is specifically for model training workloads?
If so, I'm not sure its a problem because as you say TrainJob is designed for model training. Plus, the extra status information we're proposing is optional so users can use TrainJob for more general hpc workloads if they wanted.
Also just to clarify, this proposal isn't limited to just deep learning frameworks. Other training frameworks, e.g. XGBoost, LightGBM, should be able to use it too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @robert-bell, the metrics status might be also very useful for HPC of jobs. We just need to figure out how to make these statuses agnostic to the job you run.
On the other hand, TrainJob's APIs are more appropriate for the training jobs (e.g. trainer, initializer, potentially exporter/checkpointer cc @rst0git).
At some point, we need to discuss whether these abstractions make sense for HPC jobs or we need to re-think something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These should map quite well to HPC use cases! We tend to have Figures of Merit (FOM) that are derived from different steps or stages. There are often multiple FOM. The metrics look relevant here with the exception of epoch - not everything you run in HPC is going to have an epoch, but AI/ML jobs run with MPI will.
Is there a reason to hard code metrics types (e.g., TrainMetric vs EvalMetric) versus having a common metric type, and then within that type, you could have an optional variable for the context? That would be more flexible to different kinds of applications I think, and still work well for train/eval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Honestly if its not required that each frameowkr needs to populate these statuses it is probably fine.
Its a bit odd to have a status field that wouldn't be populated for all frameworks but I can live with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to hard code metrics types (e.g., TrainMetric vs EvalMetric) versus having a common metric type, and then within that type, you could have an optional variable for the context? That would be more flexible to different kinds of applications I think, and still work well for train/eval.
+1 to design a more flexible metric typing mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a bit odd to have a status field that wouldn't be populated for all frameworks but I can live with that.
@kannon92 - The intention is the field should be populated in all frameworks, but in practice we don't want to make it mandatory to avoid introducing a barrier to adding new frameworks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason to hard code metrics types (e.g., TrainMetric vs EvalMetric) versus having a common metric type, and then within that type, you could have an optional variable for the context? That would be more flexible to different kinds of applications I think, and still work well for train/eval.
@vsoch Were you thinking something like this?
trainingStatus:
metrics:
- type: train
values:
loss: 0.8
auroc: 0.7
...
- type: eval
values:
loss: 0.7
auroc: 0.6
...
type here could be an arbitrary "label" for the group.
This approach would preserve readability for users doing kubectl describe trainjob name.
For context, my motivation for using separate fields for train and eval metrics is from thinking about supporting potential future integration with HPOs where it may be useful to be able to compare a metric value for train and eval data sets. E.g., it may be possible to detect overfitting by comparing the train and eval loss. I appreciate this isn't particularly strong motivation or completely thought through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That could work! I was also thinking:
trainingStatus:
metrics:
- name: loss
value: 0.8
type: train
- name: auroc
value: 0.7
type: trainI like your design because finding a type is a O(1) lookup instead of needing to loop through an entire list (which could be large).
Pull Request Test Coverage Report for Build 18878218330Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
szaher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks very much @robert-bell @abhijeet-dhumal
This is great proposal, it brings a lot of value like
- Excellent User Experience: The goal is perfect. A data scientist seeing PROGRESS: 45% and ETA: 15m is a massive win.
- No dependency on external components like prometheus or sidecars
- Nice SDK integration with transformers library
but it has few problems too
- Architectural anti-pattern by allowing control plane to access data plane and stream logs
- While the footprint of go-routine is minimal on the operator itself but it abuses the Control Plane by maintaining constant
GETrequest to/api/v1/namespaces/.../pods/.../log?follow=trueconnection for every single active TrainJob (so assuming scalability to 1000 job is questionable) - Broad Security Permissions for the training operator to access cluster wide pod logs
- Error prone as if training is verbose some dataset might have very large line examples or there might be extremely long log message
Another proposed solution is as follows:
Enable the KubeFlowCallBack to send status to the TrainJob directly via kubeapi. When the operator starts a new training job it creates a serviceaccount and injects it into the master/primary pod and let the CallBack get the status in a clean way and patch or update the TrainJob directly injecting the required status updates.
this approach is much cleaner
- No log parsing
- No go-routines
- No impact on the kubeapi-server for fetching and streaming logs
- Only master/primary pod can update the status and other worker pods will check if it doesn't have the right permissions skip updating the status
- Let the actual train job itself update itself
Overall, this is excellent and I think we should push it forward.
|
|
||
| ### Goals | ||
|
|
||
| 1. **Expose real-time progress information and training metrics through the TrainJobs CR.** For example: percentage complete, estimated time remaining, current step/epoch, total steps/epochs, eval metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep training and eval metrics outside the status as Data Scientists have dedicated tools for this like mlflow, tensorboard, wandb, ...etc.
Also, it seems the proposed solution will capture only the last value from the logs which isn't too helpful for the data scientists or ML persona who will prefer something like tensorboard ...etc. for understand what happened during their training process.
This is unlike Katib or HPO where we have an optimization problem and final metrics are important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it seems the proposed solution will capture only the last value from the logs which isn't too helpful for the data scientists or ML persona who will prefer something like tensorboard ...etc. for understand what happened during their training process.
I think the main target of this KEP is client integration, e.g. CLI (e.g. kubectl), GUI or Katib that would consume those metrics, not the data scientists.persona, and it's indeed not the goal to overlap with tools like Tensorboard.
This is unlike Katib or HPO where we have an optimization problem and final metrics are important.
Exactly Katib would indeed be a candidate client to consume those metrics to carry the optimizations.
|
|
||
| ### Story 2: Data Scientist / ML Engineer monitoring training jobs | ||
|
|
||
| As a data scientist or ML Engineer, I want to see real-time information about my training jobs so I can make decisions on whether those jobs are likely to succeed or whether intervention is required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the controller always stream the logs to provide real-time information?
- How does that scale?
- Is this design anti-pattern since control plane access data plane ? The control plane's job is to make decisions. It includes the API server, schedulers, and your controller/Operator. Its job is to "establish and enforce policy." It is not designed for high-throughput, high-volume data processing. It's designed for state reconciliation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a fair point that the log streaming approach is mixing control plane with data plane, though I wonder what the max throughput would be in practice? I'd possibly anticipate that applications would be adding at most a few lines to the logs per second, meaning throughput from a single job would be ~kB/s, and throughput from 1000 active train jobs may be ~MB/s which should not be particularly onerous on the controller. This is something that could be benchmarked of course.
| trainMetrics: dict[str, float] | None | ||
| evalMetrics: dict[str, float] | None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are very generic and not well defined. What metrics are we expecting. Since we are going to develop the callback in the SDK, should this be more well defined or structured to identify which metrics will be available?
I believe last value from training metrics much not be of much value to the data scientists who are more interested in how specific metric is behaving or doing over the training period. Also, this overlaps with things like MLFlow, TensorBoard, ...etc. which we already have a proposal to included it in kubeflow here kubeflow/community#892
progressPercentage, estimatedRemainingSeconds, currentEpoch, ...etc. give very good information already about the training process so I suggest we get rid of trainingMetrics and evalMetrics in favor of MLFlow
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe last value from training metrics much not be of much value to the data scientists who are more interested in how specific metric is behaving or doing over the training period. Also, this overlaps with things like MLFlow, TensorBoard, ...etc. which we already have a proposal to included it in kubeflow here kubeflow/community#892
As being discussed in the comment above, the main audience / target of this KEP is not the data scientists persona, and it's indeed not the goal to overlap with tools like Tensorboard nor with the scope of experimentation tracking.
The main target of this KEP is client integration, e.g. GUI or Katib that would consume those metrics.
To add to this analysis, given the TrainJob "primary" Pods run arbitrary user code and extra packages, granting these Pods privilege access to the API server would warrant hardened security sandboxing. |
Granting permissions so the train job itself can directly update the training status comes with a few more downsides -
|
|
Hi Folks, thanks again for this effort! That feature should significantly improve experience of AI workloads on Kubernetes, and has a lot of future potential for checkpointing, scheduling, etc. 🎉 @astefanutti @robert-bell Do we need to update KEP based on our conversation with @astefanutti and @tenzen-y at KubeCon, so we can review and move this forward? cc @romanbaron @EkinKarabulut @Ronkahn21 as we discussed, this should address problems to detect running status of TrainJob mentioning by Ron in this talk: https://sched.co/28D8Z |
What this PR does / why we need it:
This is a KEP for adding real-time progress and exposing training metrics on TrainJob.
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Part of #2779