feat(docs): KEP-2599: Mutable Runtimes#3428
feat(docs): KEP-2599: Mutable Runtimes#3428robert-bell wants to merge 5 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR adds KEP-2599 describing a design to make TrainingRuntime/ClusterTrainingRuntime effectively mutable by snapshotting runtime configuration into a per-TrainJob TrainingRuntimeSnapshot CRD on first reconciliation, decoupling job execution from subsequent runtime changes.
Changes:
- Introduces a new proposal document detailing
TrainingRuntimeSnapshotand the reconciliation/migration approach. - Describes removal of runtime finalizers and related controller/test implications (as part of the proposed implementation plan).
- Outlines test strategy (e2e/integration/unit) for snapshot creation and migration behavior.
|
Hey @VassilisVassiliadis - could I get some help understanding how this proposal interacts with your use case in #2599 (comment)? I see kubernetes-sigs/kueue#9792 was merged, but can you help me understand where things got on the trainer side? From my current understanding, I think there's a conflict with your workflow. Based on my reading, your controller:
My current proposal would break this because the TrainJob controller reconciles suspended jobs immediately and would create the TrainingRuntimeSnapshot before step 2, locking in the initial runtime instead of your updated one. I'd be happy to update the proposal - perhaps we only create a snapshot when the TrainJob is suspended. This would allow your controller to freely modify spec.runtimeRef while suspended, and the snapshot would capture the final configuration. Can you let me know if I've understood your workflow correctly? Would this fix address your needs or are there other considerations? @andreyvelich do you have any additional info? Looks like |
|
@robert-bell I guess it's worth noting that issue #2599 was originally filed to make runtimes immutable via webhook |
| ```go | ||
| // TrainingRuntimeSnapshot contains a point-in-time snapshot of a TrainingRuntime or ClusterTrainingRuntime as it was | ||
| // observed when a TrainJob was first reconciled. | ||
| type TrainingRuntimeSnapshot struct { |
There was a problem hiding this comment.
Thinking can we add source annotation like
trainer.kubeflow.org/snapshot-source: ClusterTrainingRuntime/pytorch-default on the snapshot?
Without it, once a runtime is deleted, there's no way to know which runtime a TrainJob originally used. It may become useful for debugging, auditing, and the migration scenario where you need to verify what was captured.
There was a problem hiding this comment.
I did wonder this, but I wasn't sure it was necessary given the TrainJob runtimeRef is currently immutable. There's already lineage via the TrainJob. I also wasn't sure how useful this info is given the runtime could have changed between the resource creating and being deleted. Wdyt?
There was a problem hiding this comment.
Fair point that runtimeRef provides lineage. I'd still lean towards adding it - it's one annotation and the value is in debugging production issues quickly. I think when you're using kubectl get trainingruntimesnapshots and trying to correlate with runtimes, having the source right on the snapshot saves a hop through the TrainJob. But not a hill I'll die on - fine to defer to implementation time if others feel the same. 👍
There was a problem hiding this comment.
It's a good point - I can see the benefit of the annotation, and it's low risk given it's only an annotation. I've updated the text.
| - list | ||
| - patch | ||
| - update | ||
| - watch |
There was a problem hiding this comment.
nit: I guessdelete verb is missing here. The controller might need it for edge cases? (e.g. snapshot recreation if ownerReference is wrong, or test cleanup).
The SSA FieldOwner("trainer") pattern used
elsewhere in the controller also benefits from having delete available.. WDYT?
There was a problem hiding this comment.
Good spot, though it looks like the other resources don't have delete permissions and we should be consistent. I guess we technically don't need delete because we allow the api server gc to delete the resources?
|
|
||
| We propose making the TrainJob only look up the runtime configuration on first reconciliation and instead store a "snapshot" of the runtime configuration in a separate object: | ||
|
|
||
| * create a new namespaced custom resource `TrainingRuntimeSnapshot` with the same API as the `TrainingRuntime` resource. This is an internal resource and should only be created or updated by the trainer controller. Each `TrainJob` will have one `TrainingRuntimeSnapshot` with the same name and namespace as the `TrainJob`. |
There was a problem hiding this comment.
Using the same name as the TrainJob is clean, but what happens if a TrainingRuntimeSnapshot with that name already exists when the controller tries to create it?
There was a problem hiding this comment.
Good thought, though I think it shouldn't happen given the snapshot is internal and should only be created by this controller. For there to be a clash under normal operation, I guess there'd need to be two different TrainJobs with the same name and namespace? Could it happen if a trainjob is deleted and quickly recreated with the same name but different config?
Btw - this is consistent with the pattern used elsewhere, e.g. mpi secrets are named <trainjob-name>-mpi-ssh-auth.
|
|
||
| ### API and RBAC changes | ||
|
|
||
| The `TrainingRuntimeSnapshot` will have the following API: |
There was a problem hiding this comment.
Following up on recent issues with kueue : #2599 (comment)
The current TrainJob controller reconciles suspended jobs - it calls reconcileObjects for
any non-finished TrainJob . IIUC "Snapshot on first reconciliation" means a suspended TrainJob gets snapshotted immediately on creation, before
an external controller (e.g. Kueue integration) can patch spec.runtimeRef ?
There was a problem hiding this comment.
Yes this is a something I'm wanting to discuss in today's community call.
There was a problem hiding this comment.
AFAIK the runtimeRef field is currently immutable so I think we can break this down into different questions:
- how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?
- how do we allow external controllers (e.g. Kueue) or users to patch "important" fields in a TrainJob ?
- do we want to allow mutating the runtimeRef field?
For example, if the goal is to allow starting a TrainJob using:
- all options from a RT/CRT object, BUT
- with a different image
There's several ways to do this.
One is:
- create a new RT/CRT object which is a copy of the old one but with an updated image
- reference it
Another is:
- Reference the "old" RT/CRT object,
- override the image using the runtimePatches API (KEP-KEP-2170: Kubeflow Trainer V2 API #2170 does not support mutating this field right now)
If we treat a RT/CRT as the "lowest priority blueprint" for a TrainJob and allowing the TrainJob itself, or even external controllers/users, to override the values of fields that exist in this "blueprint" then this KEP is a valid answer to the question:
- how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?
Technically, we still need to lift the immutability of the spec.runtimeRef field so that it points to the Snapshot but we can do it in a way where the TrainJob webhook only allows changing the value from a RT/CRT to a RT/CRT-SnapShot this way it's immutable except during 1st reconciliation.
P.S. Some more context:
The above suggestion implies the following "blueprint" priority:
- highest priority: The runtimePatches api (this overrides everything else below)
- middle priority: the remaining fields of TrainJob (e.g. trainjob.spec.trainer, etc)
- lowest priority: the RT/CRT fields
There was a problem hiding this comment.
+1 to @VassilisVassiliadis's framing of breaking this into three separate questions. For this KEP specifically, I think the cleanest path forward is: snapshot only when spec.suspend transitions from true to false (or on first reconciliation if already unsuspended). This gives external controllers the window they need to modify things while suspended, and the snapshot captures the final state. Would that work for both your use cases?
There was a problem hiding this comment.
I think this should work.
There was a problem hiding this comment.
Thanks @VassilisVassiliadis — the breakdown into separate questions is very helpful.
For this KEP, I suggest we focus scope on the first question:
how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?
The other two:
how do we allow external controllers (e.g. Kueue) or users to patch "important" fields in a TrainJob?
do we want to allow mutating the runtimeRef field?
seem broader in scope and likely need more dedicated design, given their impact on API semantics, maintainability, and testing. It may be better to address those in a follow-up KEP or directly as part of your kep.
Approaches like snapshot-on-unsuspend could be explored as part of that work, though we should also evaluate the runtimePatches api approach.
Importantly, I don’t think this proposal blocks those future directions - it preserves the current immutability behaviour but just decoupled the job from the runtime.
@VassilisVassiliadis if we're limited scope to the first question is there anything in the design that would impact the controller-driven workflow you’re thinking of or introduce constraints we should account for now?
There was a problem hiding this comment.
@robert-bell I agree with everything you said.
Especially:
For this KEP, I suggest we focus scope on the first question: how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?
And:
[the other 2 questions] seem broader in scope and likely need more dedicated design, given their impact on API semantics, maintainability, and testing. It may be better to address those in a follow-up KEP or directly as part of your kep.
I think this KEP is a valid solution to the "1st question" without introducing any blockers to finding answers for the other 2 questions.
The one thing that's hard to give my opinion on is the introduction of the TrainingRuntimeSnapshot CRD. That is something the core maintainers should comment on because they have a better view of Trainer than I do. My 2 cents is that this is fine because this proposed approach addresses an existing problem (lifecycles of RT/CRT tied to TrainJob objects) in a way that's better than the alternatives and since TrainingRuntimeSnapshot is an "internal" CRD users won't interact with it and thus they won't mind.
@andreyvelich what do you think ?
|
Thanks @abhijeet-dhumal I've made the divergence from the original v2 kep in the summary, you're right it needs calling out clearly. |
Proposes allowing TrainingRuntimes and ClusterTrainingRuntimes to be mutable by introducing TrainingRuntimeSnapshot resources. TrainJobs snapshot their referenced runtime configuration on first reconciliation, decoupling job execution from runtime changes. Key changes: - New TrainingRuntimeSnapshot CRD to store point-in-time runtime config - Remove finalizers from runtimes (no longer needed) - Remove TrainJobWatcher interface and boilerplate - Automatic migration for existing TrainJobs on upgrade Addresses issue kubeflow#2599 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Rob Bell <robell@redhat.com>
Improvements to the mutable runtimes KEP: - Clarify operational problems with finalizers (orphaned finalizers, namespace deletion issues) - Better explain why runtime updates are risky (fetched on every reconciliation, no design-level guarantees) - Add concrete examples of runtime proliferation (pytorch-2.0, pytorch-2.1) - Add RBAC section showing required permissions - Improve summary to clearly state API change (new CRD) - Clarify test cases for migration scenarios Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Rob Bell <robell@redhat.com>
- Fix relative link to include README.md in Trainer v2 design reference - Remove extra space in Goals heading - Change 'Training Job' to 'TrainJobs' for consistency - Fix 'lookup' to 'look up' (correct verb form) - Add 'create' verb to RBAC permissions for TrainingRuntimeSnapshot Signed-off-by: Rob Bell <robell@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Rob Bell <robell@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Rob Bell <robell@redhat.com>
578dd50 to
caae12a
Compare
Summary
This KEP introduces a new
TrainingRuntimeSnapshotCRD for containing a point-in-time snapshot of the runtime configuration. TrainJobs create a snapshot of their runtime configuration on first reconciliation, decoupling job execution from runtime changes.Proposed Changes
TrainingRuntimeSnapshotCRD: Stores point-in-time snapshot of runtime configurationBenefits
Migration
Migration is fully automatic with no manual intervention required. Existing non-finished TrainJobs automatically create snapshots on first reconciliation after upgrade. Runtime finalizers are progressively removed as snapshots are created.
Relates to #2599
🤖 Generated with Claude Code