Skip to content

feat(docs): KEP-2599: Mutable Runtimes#3428

Open
robert-bell wants to merge 5 commits intokubeflow:masterfrom
robert-bell:kep-mutable-runtimes
Open

feat(docs): KEP-2599: Mutable Runtimes#3428
robert-bell wants to merge 5 commits intokubeflow:masterfrom
robert-bell:kep-mutable-runtimes

Conversation

@robert-bell
Copy link
Copy Markdown
Contributor

Summary

This KEP introduces a new TrainingRuntimeSnapshot CRD for containing a point-in-time snapshot of the runtime configuration. TrainJobs create a snapshot of their runtime configuration on first reconciliation, decoupling job execution from runtime changes.

Proposed Changes

  • New TrainingRuntimeSnapshot CRD: Stores point-in-time snapshot of runtime configuration
  • Snapshot on first reconciliation: TrainJobs capture their runtime config once and use it throughout their lifecycle
  • Remove runtime finalizers: No longer needed since TrainJobs only reference runtimes at creation time
  • Remove TrainJobWatcher interface: Cleanup of no longer needed boilerplate

Benefits

  • Platform admins can update runtime images (e.g., for security patches, newer PyTorch versions) without risk
  • Runtimes can be deleted without worrying about breaking existing jobs
  • No more namespace deletion issues due to stuck finalizers
  • TrainJobs become fully self-contained (only depend on resources they own)

Migration

Migration is fully automatic with no manual intervention required. Existing non-finished TrainJobs automatically create snapshots on first reconciliation after upgrade. Runtime finalizers are progressively removed as snapshots are created.

Relates to #2599

🤖 Generated with Claude Code

Copilot AI review requested due to automatic review settings April 14, 2026 07:59
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds KEP-2599 describing a design to make TrainingRuntime/ClusterTrainingRuntime effectively mutable by snapshotting runtime configuration into a per-TrainJob TrainingRuntimeSnapshot CRD on first reconciliation, decoupling job execution from subsequent runtime changes.

Changes:

  • Introduces a new proposal document detailing TrainingRuntimeSnapshot and the reconciliation/migration approach.
  • Describes removal of runtime finalizers and related controller/test implications (as part of the proposed implementation plan).
  • Outlines test strategy (e2e/integration/unit) for snapshot creation and migration behavior.

Comment thread docs/proposals/2599-mutable-runtimes/README.md Outdated
Comment thread docs/proposals/2599-mutable-runtimes/README.md Outdated
Comment thread docs/proposals/2599-mutable-runtimes/README.md Outdated
Comment thread docs/proposals/2599-mutable-runtimes/README.md
Comment thread docs/proposals/2599-mutable-runtimes/README.md Outdated
@robert-bell
Copy link
Copy Markdown
Contributor Author

Hey @VassilisVassiliadis - could I get some help understanding how this proposal interacts with your use case in #2599 (comment)? I see kubernetes-sigs/kueue#9792 was merged, but can you help me understand where things got on the trainer side?

From my current understanding, I think there's a conflict with your workflow. Based on my reading, your controller:

  1. Creates a suspended TrainJob with an initial spec.runtimeRef
  2. Analyzes the job and patches spec.runtimeRef to select better resources
  3. Unsuspends the job for Kueue to schedule

My current proposal would break this because the TrainJob controller reconciles suspended jobs immediately and would create the TrainingRuntimeSnapshot before step 2, locking in the initial runtime instead of your updated one.

I'd be happy to update the proposal - perhaps we only create a snapshot when the TrainJob is suspended. This would allow your controller to freely modify spec.runtimeRef while suspended, and the snapshot would capture the final configuration.

Can you let me know if I've understood your workflow correctly? Would this fix address your needs or are there other considerations?

@andreyvelich do you have any additional info? Looks like runtimeRef is immutable in the API - do we have any design for making it mutable?

@abhijeet-dhumal
Copy link
Copy Markdown
Member

@robert-bell I guess it's worth noting that issue #2599 was originally filed to make runtimes immutable via webhook
validations. This KEP proposes an alternative approach - making runtimes freely mutable by
decoupling TrainJobs from live runtime state via snapshots.
is it ok to add sentence here (or in Motivation) explicitly acknowledging this divergence?

```go
// TrainingRuntimeSnapshot contains a point-in-time snapshot of a TrainingRuntime or ClusterTrainingRuntime as it was
// observed when a TrainJob was first reconciled.
type TrainingRuntimeSnapshot struct {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking can we add source annotation like
trainer.kubeflow.org/snapshot-source: ClusterTrainingRuntime/pytorch-default on the snapshot?
Without it, once a runtime is deleted, there's no way to know which runtime a TrainJob originally used. It may become useful for debugging, auditing, and the migration scenario where you need to verify what was captured.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did wonder this, but I wasn't sure it was necessary given the TrainJob runtimeRef is currently immutable. There's already lineage via the TrainJob. I also wasn't sure how useful this info is given the runtime could have changed between the resource creating and being deleted. Wdyt?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point that runtimeRef provides lineage. I'd still lean towards adding it - it's one annotation and the value is in debugging production issues quickly. I think when you're using kubectl get trainingruntimesnapshots and trying to correlate with runtimes, having the source right on the snapshot saves a hop through the TrainJob. But not a hill I'll die on - fine to defer to implementation time if others feel the same. 👍

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good point - I can see the benefit of the annotation, and it's low risk given it's only an annotation. I've updated the text.

- list
- patch
- update
- watch
Copy link
Copy Markdown
Member

@abhijeet-dhumal abhijeet-dhumal Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I guessdelete verb is missing here. The controller might need it for edge cases? (e.g. snapshot recreation if ownerReference is wrong, or test cleanup).
The SSA FieldOwner("trainer") pattern used
elsewhere in the controller also benefits from having delete available.. WDYT?

Copy link
Copy Markdown
Contributor Author

@robert-bell robert-bell Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot, though it looks like the other resources don't have delete permissions and we should be consistent. I guess we technically don't need delete because we allow the api server gc to delete the resources?


We propose making the TrainJob only look up the runtime configuration on first reconciliation and instead store a "snapshot" of the runtime configuration in a separate object:

* create a new namespaced custom resource `TrainingRuntimeSnapshot` with the same API as the `TrainingRuntime` resource. This is an internal resource and should only be created or updated by the trainer controller. Each `TrainJob` will have one `TrainingRuntimeSnapshot` with the same name and namespace as the `TrainJob`.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the same name as the TrainJob is clean, but what happens if a TrainingRuntimeSnapshot with that name already exists when the controller tries to create it?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thought, though I think it shouldn't happen given the snapshot is internal and should only be created by this controller. For there to be a clash under normal operation, I guess there'd need to be two different TrainJobs with the same name and namespace? Could it happen if a trainjob is deleted and quickly recreated with the same name but different config?

Btw - this is consistent with the pattern used elsewhere, e.g. mpi secrets are named <trainjob-name>-mpi-ssh-auth.


### API and RBAC changes

The `TrainingRuntimeSnapshot` will have the following API:
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on recent issues with kueue : #2599 (comment)
The current TrainJob controller reconciles suspended jobs - it calls reconcileObjects for
any non-finished TrainJob . IIUC "Snapshot on first reconciliation" means a suspended TrainJob gets snapshotted immediately on creation, before
an external controller (e.g. Kueue integration) can patch spec.runtimeRef ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this is a something I'm wanting to discuss in today's community call.

cc. @VassilisVassiliadis

Copy link
Copy Markdown
Contributor

@VassilisVassiliadis VassilisVassiliadis Apr 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIK the runtimeRef field is currently immutable so I think we can break this down into different questions:

  1. how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?
  2. how do we allow external controllers (e.g. Kueue) or users to patch "important" fields in a TrainJob ?
  3. do we want to allow mutating the runtimeRef field?

For example, if the goal is to allow starting a TrainJob using:

  • all options from a RT/CRT object, BUT
  • with a different image

There's several ways to do this.
One is:

  1. create a new RT/CRT object which is a copy of the old one but with an updated image
  2. reference it

Another is:

  1. Reference the "old" RT/CRT object,
  2. override the image using the runtimePatches API (KEP-KEP-2170: Kubeflow Trainer V2 API #2170 does not support mutating this field right now)

If we treat a RT/CRT as the "lowest priority blueprint" for a TrainJob and allowing the TrainJob itself, or even external controllers/users, to override the values of fields that exist in this "blueprint" then this KEP is a valid answer to the question:

  1. how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?

Technically, we still need to lift the immutability of the spec.runtimeRef field so that it points to the Snapshot but we can do it in a way where the TrainJob webhook only allows changing the value from a RT/CRT to a RT/CRT-SnapShot this way it's immutable except during 1st reconciliation.

P.S. Some more context:

The above suggestion implies the following "blueprint" priority:

  • highest priority: The runtimePatches api (this overrides everything else below)
  • middle priority: the remaining fields of TrainJob (e.g. trainjob.spec.trainer, etc)
  • lowest priority: the RT/CRT fields

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to @VassilisVassiliadis's framing of breaking this into three separate questions. For this KEP specifically, I think the cleanest path forward is: snapshot only when spec.suspend transitions from true to false (or on first reconciliation if already unsuspended). This gives external controllers the window they need to modify things while suspended, and the snapshot captures the final state. Would that work for both your use cases?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should work.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @VassilisVassiliadis — the breakdown into separate questions is very helpful.

For this KEP, I suggest we focus scope on the first question:

how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?

The other two:

how do we allow external controllers (e.g. Kueue) or users to patch "important" fields in a TrainJob?

do we want to allow mutating the runtimeRef field?

seem broader in scope and likely need more dedicated design, given their impact on API semantics, maintainability, and testing. It may be better to address those in a follow-up KEP or directly as part of your kep.

Approaches like snapshot-on-unsuspend could be explored as part of that work, though we should also evaluate the runtimePatches api approach.

Importantly, I don’t think this proposal blocks those future directions - it preserves the current immutability behaviour but just decoupled the job from the runtime.

@VassilisVassiliadis if we're limited scope to the first question is there anything in the design that would impact the controller-driven workflow you’re thinking of or introduce constraints we should account for now?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robert-bell I agree with everything you said.

Especially:

For this KEP, I suggest we focus scope on the first question: how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?

And:

[the other 2 questions] seem broader in scope and likely need more dedicated design, given their impact on API semantics, maintainability, and testing. It may be better to address those in a follow-up KEP or directly as part of your kep.

I think this KEP is a valid solution to the "1st question" without introducing any blockers to finding answers for the other 2 questions.

The one thing that's hard to give my opinion on is the introduction of the TrainingRuntimeSnapshot CRD. That is something the core maintainers should comment on because they have a better view of Trainer than I do. My 2 cents is that this is fine because this proposed approach addresses an existing problem (lifecycles of RT/CRT tied to TrainJob objects) in a way that's better than the alternatives and since TrainingRuntimeSnapshot is an "internal" CRD users won't interact with it and thus they won't mind.

@andreyvelich what do you think ?

@robert-bell
Copy link
Copy Markdown
Contributor Author

Thanks @abhijeet-dhumal I've made the divergence from the original v2 kep in the summary, you're right it needs calling out clearly.

robert-bell and others added 5 commits April 21, 2026 16:40
Proposes allowing TrainingRuntimes and ClusterTrainingRuntimes to be
mutable by introducing TrainingRuntimeSnapshot resources. TrainJobs
snapshot their referenced runtime configuration on first reconciliation,
decoupling job execution from runtime changes.

Key changes:
- New TrainingRuntimeSnapshot CRD to store point-in-time runtime config
- Remove finalizers from runtimes (no longer needed)
- Remove TrainJobWatcher interface and boilerplate
- Automatic migration for existing TrainJobs on upgrade

Addresses issue kubeflow#2599

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Improvements to the mutable runtimes KEP:
- Clarify operational problems with finalizers (orphaned finalizers,
  namespace deletion issues)
- Better explain why runtime updates are risky (fetched on every
  reconciliation, no design-level guarantees)
- Add concrete examples of runtime proliferation (pytorch-2.0, pytorch-2.1)
- Add RBAC section showing required permissions
- Improve summary to clearly state API change (new CRD)
- Clarify test cases for migration scenarios

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Rob Bell <robell@redhat.com>
- Fix relative link to include README.md in Trainer v2 design reference
- Remove extra space in Goals heading
- Change 'Training Job' to 'TrainJobs' for consistency
- Fix 'lookup' to 'look up' (correct verb form)
- Add 'create' verb to RBAC permissions for TrainingRuntimeSnapshot

Signed-off-by: Rob Bell <robell@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Rob Bell <robell@redhat.com>
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Signed-off-by: Rob Bell <robell@redhat.com>
@robert-bell robert-bell force-pushed the kep-mutable-runtimes branch from 578dd50 to caae12a Compare April 21, 2026 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants