feat(docs): proposal for adding TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to TrainJob CRD#3068
feat(docs): proposal for adding TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to TrainJob CRD#3068XploY04 wants to merge 11 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR introduces a comprehensive design proposal (KEP-style document) for adding TTL-based automatic cleanup and runtime deadline enforcement to the TrainJob CRD. The proposal addresses resource management issues by enabling automatic deletion of finished jobs and preventing runaway training workloads.
Key Changes
- Proposes adding
TTLSecondsAfterFinishedfield for automatic deletion of completed TrainJobs - Proposes adding
ActiveDeadlineSecondsfield to enforce maximum runtime limits - Includes detailed implementation plan, test strategy, production readiness considerations, and upgrade/downgrade procedures
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…dlineSeconds fields to TrainJob CRD Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
|
Hey @andreyvelich |
Added new fields to TrainJobStatus for resolved TTL and deadline values. Updated handling for clock skew and TrainingRuntime deletion scenarios. Signed-off-by: Yash Agarwal <2004agarwalyash@gmail.com>
Pull Request Test Coverage Report for Build 22064803787Details
💛 - Coveralls |
…SecondsAfterFinished validation, and remove proposed status fields, SDK changes, and metrics. Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
andreyvelich
left a comment
There was a problem hiding this comment.
Sorry for the late reply @XploY04!
Overall, looks great, I left a few thoughts.
| - Expose `TTLSecondsAfterFinished` in the SDK (this is platform admin controlled) | ||
| - Automatically migrate existing TrainJobs to use new defaults | ||
| - Provide per-namespace TTL overrides | ||
|
|
There was a problem hiding this comment.
If you could add some user stories that would be helpful to explain why we want to add ActiveDeadlineSeconds to TrainJob and TTLSecondsAfterFinished to Runtime.
Ref: https://github.com/kubeflow/trainer/tree/master/docs/proposals/2442-jax-runtime-trainer-v2#user-stories
| // +kubebuilder:validation:Minimum=0 | ||
| TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"` | ||
|
|
||
| // ActiveDeadlineSeconds specifies the default maximum runtime for TrainJobs | ||
| // using this runtime. Individual TrainJobs can override this value by setting | ||
| // their own ActiveDeadlineSeconds. | ||
| // +optional | ||
| // +kubebuilder:validation:Minimum=1 | ||
| ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"` |
There was a problem hiding this comment.
@XploY04 I would suggest to remove activeDeadlineSeconds from Runtime spec initially, and tell users to configure timeout in trainJob.spec directly.
Once we get feedback that users want to configure timeout in Runtime for all TrainJob, we can extend it easily.
| Add new condition reason in `pkg/apis/trainer/v1alpha1/trainjob_types.go`: | ||
|
|
||
| ```go | ||
| const ( | ||
| // TrainJobDeadlineExceededReason is used when ActiveDeadlineSeconds is exceeded | ||
| TrainJobDeadlineExceededReason string = "DeadlineExceeded" | ||
| ) |
There was a problem hiding this comment.
This should be set for Failed condition in TrainJob, right ?
Like in Job: https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup
|
|
||
| | Field | TrainJob Value | Runtime Value | Effective Value | | ||
| |-------|---------------|---------------|-----------------| | ||
| | `ActiveDeadlineSeconds` | Set | Set | **TrainJob value** (override) | | ||
| | `ActiveDeadlineSeconds` | Set | Unset | TrainJob value | | ||
| | `ActiveDeadlineSeconds` | Unset | Set | Runtime value (default) | | ||
| | `ActiveDeadlineSeconds` | Unset | Unset | No deadline enforced | | ||
| | `TTLSecondsAfterFinished` | N/A | Set | Runtime value | | ||
| | `TTLSecondsAfterFinished` | N/A | Unset | No TTL cleanup | |
There was a problem hiding this comment.
You don’t need to include this table. Simply state that values defined in TrainJob take precedence over those specified in Runtime.
| # Uses runtime defaults: 8-hour deadline, 24-hour TTL | ||
| ``` | ||
|
|
||
| **TrainJob Overriding Deadline (Data Scientist):** |
There was a problem hiding this comment.
Could you also add simple example with Kubeflow SDK and train() API where AI practitioners can set timeout:
TrainerClient().train(
trainer=CustomTrainer(
func=get_torch_dist,
num_nodes=3,
),
initializer=Initializer(
model=HuggingFaceDatasetInitializer(storage_uri="hf://qwen3.2-instruct")
),
timeout=500
)cc @kubeflow/kubeflow-sdk-team
There was a problem hiding this comment.
Sure, I will add this and I can also take this up, after the implementation is completed here.
|
|
||
| ### Implementation Overview | ||
|
|
||
| **Controller Changes** (`pkg/controller/trainjob_controller.go`): |
There was a problem hiding this comment.
@tenzen-y @XploY04 Do we need to implement any of this functionality in runtime framework?
As of now we use Info and PodSets to merge parameters: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/runtime.go#L36
There was a problem hiding this comment.
I don't think, any of these functionality is needed in the runtime framework because
- TTL: Must be handled at the TrainJob level because we need to delete the TrainJob object itself. Setting TTL on the JobSet would only delete the JobSet, leaving orphaned TrainJobs in etcd.
- Deadline: We can add deadline as a secondary enforcement in the runtime framework, but the controller needs to set the
failedcondition withReason: Deadline Exceededon the trainjob, which can't be achieved from Job-level activeDeadlineSeconds only.
So, I would not recommend passing any to the runtime framework for now.
There was a problem hiding this comment.
Sounds good, we can define the logic in the TrainJob controller directly .
| // +optional | ||
| // +kubebuilder:validation:Minimum=1 | ||
| // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="field is immutable" | ||
| ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"` |
There was a problem hiding this comment.
As alternative we can consider to use TemplateOverrides or Overrides API to update this value in Runtime: #3199
But that will force us to have something like this:
type Override struct {
Manager string `json:"manager,omitempty"`
// runtimeSpecOverrides defines overrides that applied to Runtime spec
RuntimeSpecOverrides []RuntimeSpecOverrides `json:"runtimeSpecOverrides,omitempty"`
// jobTemplateOverrides defines overrides that applied to JobTemplateSpec
JobTemplateOverrides []JobTemplateOverride `json:"jobTemplateOverrides,omitempty"`
// podTemplateOverrides defines overrides that applied to PodTemplateSpec
PodTemplateOverrides []PodTemplateOverride `json:"podTemplateOverrides,omitempty"`
}Not sure if that makes sense, compare to simple trainJob.spec.activeDeadlineSeconds.
There was a problem hiding this comment.
That looks overly complicated at first glance.
There was a problem hiding this comment.
True. An alternative to introducing RuntimeOverrides into the Override API would be to duplicate the relevant parameters directly in the TrainJob spec.
For example, if we decide that certain parameters should be overridable at the TrainJob level, we could define a dedicated field such as trainJob.spec.workloadSpec or trainJob.spec.podGroupPolicy.
@tenzen-y What do you think?
| 1. Controller-runtime triggers initial sync, reconciling all TrainJobs | ||
| 2. For each TrainJob, deadlines and TTL are recalculated from: | ||
| - The last resume time (or `metadata.creationTimestamp` if never suspended) for deadline calculation | ||
| - `LastTransitionTime` of the `Complete` or `Failed` condition for TTL calculation | ||
| - The referenced TrainingRuntime (protected from deletion via the `ResourceInUse` finalizer) | ||
| 3. If deadline/TTL already expired during downtime, action is taken immediately | ||
| 4. Otherwise, appropriate requeue times are set | ||
|
|
||
| This design ensures no TrainJobs are "forgotten" after a controller restart. |
There was a problem hiding this comment.
Do we know if Job has similar semantic?
cc @kannon92
There was a problem hiding this comment.
The K8s job controller has same semantics,
- for deadlines, on every sync the controller calls
pastActivedeadline()which recalculates the deadline from the persistedjob.status.startTime:
// From syncJob():
} else if jm.pastActiveDeadline(&job) {
jobCtx.finishedCondition = jm.newFailureCondition(
batch.JobReasonDeadlineExceeded,
"Job was active longer than specified deadline",
)
} else if job.Spec.ActiveDeadlineSeconds != nil && !jobSuspended(&job) {
syncDuration := time.Duration(*job.Spec.ActiveDeadlineSeconds)*time.Second -
jm.clock.Since(job.Status.StartTime.Time)
jm.queue.AddAfter(key, syncDuration)
}
- For TTL, the
ttl-after-finishedcontroller re-lists all Jobs on startup and recalculates expiry from persisted completionTime + ttlSecondsAfterFinished. If the TTL expired during downtime, deletion happens immediately.
Our proposal follows this exact same pattern using persisted timestamps (lastResumeTime, condition LastTransitionTime) to recalculate on restart, with no in-memory timer state.
Let me know if any changes are required here.
| - End-to-end TTL deletion from Runtime default | ||
| - End-to-end deadline from Runtime default | ||
| - TrainJob deadline overriding Runtime deadline | ||
| - Cascade deletion of owned resources |
There was a problem hiding this comment.
Let's also add integration tests for suspended TrainJobs
|
|
||
| This design ensures no TrainJobs are "forgotten" after a controller restart. | ||
|
|
||
| **Validation:** |
There was a problem hiding this comment.
Do we need to validate that deadline and TTL is not set in JobSet and Job?
There was a problem hiding this comment.
Yes, I think we should add it, because without it both levels might have different values that could cause conflicts. I will add this in the proposal.
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
|
Hi @andreyvelich |
| initializer=Initializer( | ||
| model=HuggingFaceDatasetInitializer(storage_uri="hf://qwen3.2-instruct") | ||
| ), | ||
| timeout=28800, # 8 hours max |
There was a problem hiding this comment.
timeout seems too generic, it may be useful to be more specific.
There was a problem hiding this comment.
Should I change it active_deadline_seconds ?
There was a problem hiding this comment.
Yeah, it looks like we agreed to have active_deadline_seconds for Katib SDK previously: kubeflow/katib#2568 (comment)
There was a problem hiding this comment.
Ok, so I will change it to active_deadline_seconds.
There was a problem hiding this comment.
That looks good / more specific to me. There might be other types of timeouts in the future.
| // +optional | ||
| // +kubebuilder:validation:Minimum=1 | ||
| // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="field is immutable" | ||
| ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"` |
There was a problem hiding this comment.
That looks overly complicated at first glance.
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
|
/ok-to-test |
What this PR does / why we need it:
Fixes #2899
PR #3065