feat(docs): KEP-2599: Mutable Runtimes by robert-bell · Pull Request #3428 · kubeflow/trainer

robert-bell · 2026-04-14T07:59:34Z

Summary

This KEP introduces a new TrainingRuntimeSnapshot CRD for containing a point-in-time snapshot of the runtime configuration. TrainJobs create a snapshot of their runtime configuration on first reconciliation, decoupling job execution from runtime changes.

Proposed Changes

New TrainingRuntimeSnapshot CRD: Stores point-in-time snapshot of runtime configuration
Snapshot on first reconciliation: TrainJobs capture their runtime config once and use it throughout their lifecycle
Remove runtime finalizers: No longer needed since TrainJobs only reference runtimes at creation time
Remove TrainJobWatcher interface: Cleanup of no longer needed boilerplate

Benefits

Platform admins can update runtime images (e.g., for security patches, newer PyTorch versions) without risk
Runtimes can be deleted without worrying about breaking existing jobs
No more namespace deletion issues due to stuck finalizers
TrainJobs become fully self-contained (only depend on resources they own)

Migration

Migration is fully automatic with no manual intervention required. Existing non-finished TrainJobs automatically create snapshots on first reconciliation after upgrade. Runtime finalizers are progressively removed as snapshots are created.

Relates to #2599

🤖 Generated with Claude Code

google-oss-prow · 2026-04-14T07:59:42Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copilot

Pull request overview

This PR adds KEP-2599 describing a design to make TrainingRuntime/ClusterTrainingRuntime effectively mutable by snapshotting runtime configuration into a per-TrainJob TrainingRuntimeSnapshot CRD on first reconciliation, decoupling job execution from subsequent runtime changes.

Changes:

Introduces a new proposal document detailing TrainingRuntimeSnapshot and the reconciliation/migration approach.
Describes removal of runtime finalizers and related controller/test implications (as part of the proposed implementation plan).
Outlines test strategy (e2e/integration/unit) for snapshot creation and migration behavior.

robert-bell · 2026-04-14T08:55:56Z

Hey @VassilisVassiliadis - could I get some help understanding how this proposal interacts with your use case in #2599 (comment)? I see kubernetes-sigs/kueue#9792 was merged, but can you help me understand where things got on the trainer side?

From my current understanding, I think there's a conflict with your workflow. Based on my reading, your controller:

Creates a suspended TrainJob with an initial spec.runtimeRef
Analyzes the job and patches spec.runtimeRef to select better resources
Unsuspends the job for Kueue to schedule

My current proposal would break this because the TrainJob controller reconciles suspended jobs immediately and would create the TrainingRuntimeSnapshot before step 2, locking in the initial runtime instead of your updated one.

I'd be happy to update the proposal - perhaps we only create a snapshot when the TrainJob is suspended. This would allow your controller to freely modify spec.runtimeRef while suspended, and the snapshot would capture the final configuration.

Can you let me know if I've understood your workflow correctly? Would this fix address your needs or are there other considerations?

@andreyvelich do you have any additional info? Looks like runtimeRef is immutable in the API - do we have any design for making it mutable?

abhijeet-dhumal · 2026-04-15T10:47:13Z

@robert-bell I guess it's worth noting that issue #2599 was originally filed to make runtimes immutable via webhook
validations. This KEP proposes an alternative approach - making runtimes freely mutable by
decoupling TrainJobs from live runtime state via snapshots.
is it ok to add sentence here (or in Motivation) explicitly acknowledging this divergence?

abhijeet-dhumal · 2026-04-15T10:52:59Z

+```go
+// TrainingRuntimeSnapshot contains a point-in-time snapshot of a TrainingRuntime or ClusterTrainingRuntime as it was
+// observed when a TrainJob was first reconciled.
+type TrainingRuntimeSnapshot struct {


Thinking can we add source annotation like
trainer.kubeflow.org/snapshot-source: ClusterTrainingRuntime/pytorch-default on the snapshot?
Without it, once a runtime is deleted, there's no way to know which runtime a TrainJob originally used. It may become useful for debugging, auditing, and the migration scenario where you need to verify what was captured.

I did wonder this, but I wasn't sure it was necessary given the TrainJob runtimeRef is currently immutable. There's already lineage via the TrainJob. I also wasn't sure how useful this info is given the runtime could have changed between the resource creating and being deleted. Wdyt?

Fair point that runtimeRef provides lineage. I'd still lean towards adding it - it's one annotation and the value is in debugging production issues quickly. I think when you're using kubectl get trainingruntimesnapshots and trying to correlate with runtimes, having the source right on the snapshot saves a hop through the TrainJob. But not a hill I'll die on - fine to defer to implementation time if others feel the same. 👍

It's a good point - I can see the benefit of the annotation, and it's low risk given it's only an annotation. I've updated the text.

abhijeet-dhumal · 2026-04-15T10:55:49Z

+  - list
+  - patch
+  - update
+  - watch


nit: I guessdelete verb is missing here. The controller might need it for edge cases? (e.g. snapshot recreation if ownerReference is wrong, or test cleanup).
The SSA FieldOwner("trainer") pattern used
elsewhere in the controller also benefits from having delete available.. WDYT?

Good spot, though it looks like the other resources don't have delete permissions and we should be consistent. I guess we technically don't need delete because we allow the api server gc to delete the resources?

abhijeet-dhumal · 2026-04-15T10:59:15Z

+
+We propose making the TrainJob only look up the runtime configuration on first reconciliation and instead store a "snapshot" of the runtime configuration in a separate object:
+
+* create a new namespaced custom resource `TrainingRuntimeSnapshot` with the same API as the `TrainingRuntime` resource. This is an internal resource and should only be created or updated by the trainer controller. Each `TrainJob` will have one `TrainingRuntimeSnapshot` with the same name and namespace as the `TrainJob`.


Using the same name as the TrainJob is clean, but what happens if a TrainingRuntimeSnapshot with that name already exists when the controller tries to create it?

Good thought, though I think it shouldn't happen given the snapshot is internal and should only be created by this controller. For there to be a clash under normal operation, I guess there'd need to be two different TrainJobs with the same name and namespace? Could it happen if a trainjob is deleted and quickly recreated with the same name but different config?

Btw - this is consistent with the pattern used elsewhere, e.g. mpi secrets are named <trainjob-name>-mpi-ssh-auth.

abhijeet-dhumal · 2026-04-15T11:04:38Z

+
+### API and RBAC changes
+
+The `TrainingRuntimeSnapshot` will have the following API:


Following up on recent issues with kueue : #2599 (comment)
The current TrainJob controller reconciles suspended jobs - it calls reconcileObjects for
any non-finished TrainJob . IIUC "Snapshot on first reconciliation" means a suspended TrainJob gets snapshotted immediately on creation, before
an external controller (e.g. Kueue integration) can patch spec.runtimeRef ?

Yes this is a something I'm wanting to discuss in today's community call.

cc. @VassilisVassiliadis

AFAIK the runtimeRef field is currently immutable so I think we can break this down into different questions:

how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?

how do we allow external controllers (e.g. Kueue) or users to patch "important" fields in a TrainJob ?

do we want to allow mutating the runtimeRef field?

For example, if the goal is to allow starting a TrainJob using:

all options from a RT/CRT object, BUT

with a different image

There's several ways to do this.
One is:

create a new RT/CRT object which is a copy of the old one but with an updated image

reference it

Another is:

Reference the "old" RT/CRT object,

override the image using the runtimePatches API (KEP-KEP-2170: Kubeflow Trainer V2 API #2170 does not support mutating this field right now)

If we treat a RT/CRT as the "lowest priority blueprint" for a TrainJob and allowing the TrainJob itself, or even external controllers/users, to override the values of fields that exist in this "blueprint" then this KEP is a valid answer to the question:

how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?

Technically, we still need to lift the immutability of the spec.runtimeRef field so that it points to the Snapshot but we can do it in a way where the TrainJob webhook only allows changing the value from a RT/CRT to a RT/CRT-SnapShot this way it's immutable except during 1st reconciliation.

P.S. Some more context:

The above suggestion implies the following "blueprint" priority:

highest priority: The runtimePatches api (this overrides everything else below)

middle priority: the remaining fields of TrainJob (e.g. trainjob.spec.trainer, etc)

lowest priority: the RT/CRT fields

+1 to @VassilisVassiliadis's framing of breaking this into three separate questions. For this KEP specifically, I think the cleanest path forward is: snapshot only when spec.suspend transitions from true to false (or on first reconciliation if already unsuspended). This gives external controllers the window they need to modify things while suspended, and the snapshot captures the final state. Would that work for both your use cases?

I think this should work.

Thanks @VassilisVassiliadis — the breakdown into separate questions is very helpful.

For this KEP, I suggest we focus scope on the first question:

how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?

The other two:

how do we allow external controllers (e.g. Kueue) or users to patch "important" fields in a TrainJob?

do we want to allow mutating the runtimeRef field?

seem broader in scope and likely need more dedicated design, given their impact on API semantics, maintainability, and testing. It may be better to address those in a follow-up KEP or directly as part of your kep.

Approaches like snapshot-on-unsuspend could be explored as part of that work, though we should also evaluate the runtimePatches api approach.

Importantly, I don’t think this proposal blocks those future directions - it preserves the current immutability behaviour but just decoupled the job from the runtime.

@VassilisVassiliadis if we're limited scope to the first question is there anything in the design that would impact the controller-driven workflow you’re thinking of or introduce constraints we should account for now?

@robert-bell I agree with everything you said.

Especially:

For this KEP, I suggest we focus scope on the first question: how do we decouple the lifecycle of a RuntimeTraining (RT)/ClusterRuntimeTraining (CRT) object from the TrainJob objects that reference them?

And:

[the other 2 questions] seem broader in scope and likely need more dedicated design, given their impact on API semantics, maintainability, and testing. It may be better to address those in a follow-up KEP or directly as part of your kep.

I think this KEP is a valid solution to the "1st question" without introducing any blockers to finding answers for the other 2 questions.

The one thing that's hard to give my opinion on is the introduction of the TrainingRuntimeSnapshot CRD. That is something the core maintainers should comment on because they have a better view of Trainer than I do. My 2 cents is that this is fine because this proposed approach addresses an existing problem (lifecycles of RT/CRT tied to TrainJob objects) in a way that's better than the alternatives and since TrainingRuntimeSnapshot is an "internal" CRD users won't interact with it and thus they won't mind.

@andreyvelich what do you think ?

robert-bell · 2026-04-15T12:12:09Z

Thanks @abhijeet-dhumal I've made the divergence from the original v2 kep in the summary, you're right it needs calling out clearly.

Proposes allowing TrainingRuntimes and ClusterTrainingRuntimes to be mutable by introducing TrainingRuntimeSnapshot resources. TrainJobs snapshot their referenced runtime configuration on first reconciliation, decoupling job execution from runtime changes. Key changes: - New TrainingRuntimeSnapshot CRD to store point-in-time runtime config - Remove finalizers from runtimes (no longer needed) - Remove TrainJobWatcher interface and boilerplate - Automatic migration for existing TrainJobs on upgrade Addresses issue kubeflow#2599 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Rob Bell <robell@redhat.com>

Improvements to the mutable runtimes KEP: - Clarify operational problems with finalizers (orphaned finalizers, namespace deletion issues) - Better explain why runtime updates are risky (fetched on every reconciliation, no design-level guarantees) - Add concrete examples of runtime proliferation (pytorch-2.0, pytorch-2.1) - Add RBAC section showing required permissions - Improve summary to clearly state API change (new CRD) - Clarify test cases for migration scenarios Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Rob Bell <robell@redhat.com>

- Fix relative link to include README.md in Trainer v2 design reference - Remove extra space in Goals heading - Change 'Training Job' to 'TrainJobs' for consistency - Fix 'lookup' to 'look up' (correct verb form) - Add 'create' verb to RBAC permissions for TrainingRuntimeSnapshot Signed-off-by: Rob Bell <robell@redhat.com>

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Rob Bell <robell@redhat.com>

Copilot AI review requested due to automatic review settings April 14, 2026 07:59

google-oss-prow Bot requested a review from akshaychitneni April 14, 2026 07:59

google-oss-prow Bot requested a review from jinchihe April 14, 2026 07:59

google-oss-prow Bot added the size/L label Apr 14, 2026

Copilot started reviewing on behalf of robert-bell April 14, 2026 08:00 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

robert-bell mentioned this pull request Apr 14, 2026

Implement validations to prevent changing TrainingRuntime #2599

Open

robert-bell mentioned this pull request Apr 14, 2026

bug: suspended TrainJob stuck in broken state when TrainingRuntime image is updated #3387

Open

robert-bell changed the title ~~KEP-2599: Mutable Runtimes~~ feat(docs): KEP-2599: Mutable Runtimes Apr 14, 2026

andreyvelich mentioned this pull request Apr 15, 2026

feat(api): add immutability for TrainingRuntimes types #3082

Closed

1 task

VassilisVassiliadis mentioned this pull request Apr 15, 2026

feat(docs): Kubeflow Trainer ROADMAP 2026 #3242

Merged

abhijeet-dhumal reviewed Apr 15, 2026

View reviewed changes

robert-bell and others added 5 commits April 21, 2026 16:40

docs: clarify KEP-2599 divergence from original design

43db85b

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Rob Bell <robell@redhat.com>

docs: add source runtime annotation to KEP-2599

caae12a

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com> Signed-off-by: Rob Bell <robell@redhat.com>

robert-bell force-pushed the kep-mutable-runtimes branch from 578dd50 to caae12a Compare April 21, 2026 15:40


		We propose making the TrainJob only look up the runtime configuration on first reconciliation and instead store a "snapshot" of the runtime configuration in a separate object:

		* create a new namespaced custom resource `TrainingRuntimeSnapshot` with the same API as the `TrainingRuntime` resource. This is an internal resource and should only be created or updated by the trainer controller. Each `TrainJob` will have one `TrainingRuntimeSnapshot` with the same name and namespace as the `TrainJob`.


		### API and RBAC changes

		The `TrainingRuntimeSnapshot` will have the following API:

Conversation

robert-bell commented Apr 14, 2026

Summary

Proposed Changes

Benefits

Migration

Uh oh!

google-oss-prow Bot commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robert-bell commented Apr 14, 2026

Uh oh!

abhijeet-dhumal commented Apr 15, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

abhijeet-dhumal Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robert-bell Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

VassilisVassiliadis Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

robert-bell commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

abhijeet-dhumal Apr 15, 2026 •

edited

Loading

robert-bell Apr 15, 2026 •

edited

Loading

VassilisVassiliadis Apr 15, 2026 •

edited

Loading