Skip to content

feat(docs): proposal for adding TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to TrainJob CRD#3068

Open
XploY04 wants to merge 11 commits intokubeflow:masterfrom
XploY04:proposal
Open

feat(docs): proposal for adding TTLSecondsAfterFinished and ActiveDeadlineSeconds fields to TrainJob CRD#3068
XploY04 wants to merge 11 commits intokubeflow:masterfrom
XploY04:proposal

Conversation

@XploY04
Copy link

@XploY04 XploY04 commented Jan 5, 2026

What this PR does / why we need it:

Fixes #2899
PR #3065

Copilot AI review requested due to automatic review settings January 5, 2026 21:10
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

github-actions bot commented Jan 5, 2026

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a comprehensive design proposal (KEP-style document) for adding TTL-based automatic cleanup and runtime deadline enforcement to the TrainJob CRD. The proposal addresses resource management issues by enabling automatic deletion of finished jobs and preventing runaway training workloads.

Key Changes

  • Proposes adding TTLSecondsAfterFinished field for automatic deletion of completed TrainJobs
  • Proposes adding ActiveDeadlineSeconds field to enforce maximum runtime limits
  • Includes detailed implementation plan, test strategy, production readiness considerations, and upgrade/downgrade procedures

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…dlineSeconds fields to TrainJob CRD

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
@XploY04
Copy link
Author

XploY04 commented Jan 22, 2026

Hey @andreyvelich
I have made the required changes. Please take a look at it.

Added new fields to TrainJobStatus for resolved TTL and deadline values. Updated handling for clock skew and TrainingRuntime deletion scenarios.

Signed-off-by: Yash Agarwal <2004agarwalyash@gmail.com>
@coveralls
Copy link

coveralls commented Feb 10, 2026

Pull Request Test Coverage Report for Build 22064803787

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 56.026%

Totals Coverage Status
Change from base Build 22051165353: 0.0%
Covered Lines: 1390
Relevant Lines: 2481

💛 - Coveralls

…SecondsAfterFinished validation, and remove proposed status fields, SDK changes, and metrics.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late reply @XploY04!
Overall, looks great, I left a few thoughts.

- Expose `TTLSecondsAfterFinished` in the SDK (this is platform admin controlled)
- Automatically migrate existing TrainJobs to use new defaults
- Provide per-namespace TTL overrides

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you could add some user stories that would be helpful to explain why we want to add ActiveDeadlineSeconds to TrainJob and TTLSecondsAfterFinished to Runtime.
Ref: https://github.com/kubeflow/trainer/tree/master/docs/proposals/2442-jax-runtime-trainer-v2#user-stories

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will add.

Comment on lines 75 to 83
// +kubebuilder:validation:Minimum=0
TTLSecondsAfterFinished *int32 `json:"ttlSecondsAfterFinished,omitempty"`

// ActiveDeadlineSeconds specifies the default maximum runtime for TrainJobs
// using this runtime. Individual TrainJobs can override this value by setting
// their own ActiveDeadlineSeconds.
// +optional
// +kubebuilder:validation:Minimum=1
ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@XploY04 I would suggest to remove activeDeadlineSeconds from Runtime spec initially, and tell users to configure timeout in trainJob.spec directly.
Once we get feedback that users want to configure timeout in Runtime for all TrainJob, we can extend it easily.

Comment on lines 89 to 95
Add new condition reason in `pkg/apis/trainer/v1alpha1/trainjob_types.go`:

```go
const (
// TrainJobDeadlineExceededReason is used when ActiveDeadlineSeconds is exceeded
TrainJobDeadlineExceededReason string = "DeadlineExceeded"
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be set for Failed condition in TrainJob, right ?
Like in Job: https://kubernetes.io/docs/concepts/workloads/controllers/job/#job-termination-and-cleanup

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will mention that here.

Comment on lines 102 to 110

| Field | TrainJob Value | Runtime Value | Effective Value |
|-------|---------------|---------------|-----------------|
| `ActiveDeadlineSeconds` | Set | Set | **TrainJob value** (override) |
| `ActiveDeadlineSeconds` | Set | Unset | TrainJob value |
| `ActiveDeadlineSeconds` | Unset | Set | Runtime value (default) |
| `ActiveDeadlineSeconds` | Unset | Unset | No deadline enforced |
| `TTLSecondsAfterFinished` | N/A | Set | Runtime value |
| `TTLSecondsAfterFinished` | N/A | Unset | No TTL cleanup |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don’t need to include this table. Simply state that values defined in TrainJob take precedence over those specified in Runtime.

# Uses runtime defaults: 8-hour deadline, 24-hour TTL
```

**TrainJob Overriding Deadline (Data Scientist):**
Copy link
Member

@andreyvelich andreyvelich Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also add simple example with Kubeflow SDK and train() API where AI practitioners can set timeout:

TrainerClient().train(
    trainer=CustomTrainer(
        func=get_torch_dist,
        num_nodes=3,
    ),
    initializer=Initializer(
        model=HuggingFaceDatasetInitializer(storage_uri="hf://qwen3.2-instruct")
    ),
    timeout=500
)

cc @kubeflow/kubeflow-sdk-team

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will add this and I can also take this up, after the implementation is completed here.


### Implementation Overview

**Controller Changes** (`pkg/controller/trainjob_controller.go`):
Copy link
Member

@andreyvelich andreyvelich Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tenzen-y @XploY04 Do we need to implement any of this functionality in runtime framework?
As of now we use Info and PodSets to merge parameters: https://github.com/kubeflow/trainer/blob/master/pkg/runtime/runtime.go#L36

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andreyvelich

I don't think, any of these functionality is needed in the runtime framework because

  • TTL: Must be handled at the TrainJob level because we need to delete the TrainJob object itself. Setting TTL on the JobSet would only delete the JobSet, leaving orphaned TrainJobs in etcd.
  • Deadline: We can add deadline as a secondary enforcement in the runtime framework, but the controller needs to set the failed condition with Reason: Deadline Exceeded on the trainjob, which can't be achieved from Job-level activeDeadlineSeconds only.

So, I would not recommend passing any to the runtime framework for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, we can define the logic in the TrainJob controller directly .

// +optional
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="field is immutable"
ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As alternative we can consider to use TemplateOverrides or Overrides API to update this value in Runtime: #3199
But that will force us to have something like this:

type Override struct {
	Manager string `json:"manager,omitempty"`
     
    // runtimeSpecOverrides defines overrides that applied to Runtime spec
    RuntimeSpecOverrides []RuntimeSpecOverrides `json:"runtimeSpecOverrides,omitempty"`

	// jobTemplateOverrides defines overrides that applied to JobTemplateSpec
	JobTemplateOverrides []JobTemplateOverride `json:"jobTemplateOverrides,omitempty"`

	// podTemplateOverrides defines overrides that applied to PodTemplateSpec
	PodTemplateOverrides []PodTemplateOverride `json:"podTemplateOverrides,omitempty"`
}

Not sure if that makes sense, compare to simple trainJob.spec.activeDeadlineSeconds.

cc @mimowo @kaisoz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks overly complicated at first glance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. An alternative to introducing RuntimeOverrides into the Override API would be to duplicate the relevant parameters directly in the TrainJob spec.

For example, if we decide that certain parameters should be overridable at the TrainJob level, we could define a dedicated field such as trainJob.spec.workloadSpec or trainJob.spec.podGroupPolicy.
@tenzen-y What do you think?

Comment on lines +219 to +227
1. Controller-runtime triggers initial sync, reconciling all TrainJobs
2. For each TrainJob, deadlines and TTL are recalculated from:
- The last resume time (or `metadata.creationTimestamp` if never suspended) for deadline calculation
- `LastTransitionTime` of the `Complete` or `Failed` condition for TTL calculation
- The referenced TrainingRuntime (protected from deletion via the `ResourceInUse` finalizer)
3. If deadline/TTL already expired during downtime, action is taken immediately
4. Otherwise, appropriate requeue times are set

This design ensures no TrainJobs are "forgotten" after a controller restart.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we know if Job has similar semantic?
cc @kannon92

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The K8s job controller has same semantics,

  • for deadlines, on every sync the controller calls pastActivedeadline() which recalculates the deadline from the persisted job.status.startTime:

https://github.com/kubernetes/kubernetes/blob/150247a304f2cb290b8db8036f9dcab938983fb1/pkg/controller/job/job_controller.go#L968-L974

// From syncJob():
} else if jm.pastActiveDeadline(&job) {
    jobCtx.finishedCondition = jm.newFailureCondition(
        batch.JobReasonDeadlineExceeded,
        "Job was active longer than specified deadline",
    )
} else if job.Spec.ActiveDeadlineSeconds != nil && !jobSuspended(&job) {
    syncDuration := time.Duration(*job.Spec.ActiveDeadlineSeconds)*time.Second - 
        jm.clock.Since(job.Status.StartTime.Time)
    jm.queue.AddAfter(key, syncDuration)
}
  • For TTL, the ttl-after-finished controller re-lists all Jobs on startup and recalculates expiry from persisted completionTime + ttlSecondsAfterFinished. If the TTL expired during downtime, deletion happens immediately.

Our proposal follows this exact same pattern using persisted timestamps (lastResumeTime, condition LastTransitionTime) to recalculate on restart, with no in-memory timer state.

@kannon92 @andreyvelich

Let me know if any changes are required here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for checking @XploY04!

- End-to-end TTL deletion from Runtime default
- End-to-end deadline from Runtime default
- TrainJob deadline overriding Runtime deadline
- Cascade deletion of owned resources
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add integration tests for suspended TrainJobs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I will add these.


This design ensures no TrainJobs are "forgotten" after a controller restart.

**Validation:**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to validate that deadline and TTL is not set in JobSet and Job?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we should add it, because without it both levels might have different values that could cause conflicts. I will add this in the proposal.

Signed-off-by: XploY04 <2004agarwalyash@gmail.com>
@XploY04
Copy link
Author

XploY04 commented Feb 12, 2026

Hi @andreyvelich
I have made the required changes in the proposal, let me know if any other changes are required.

@XploY04 XploY04 requested a review from andreyvelich February 12, 2026 16:23
initializer=Initializer(
model=HuggingFaceDatasetInitializer(storage_uri="hf://qwen3.2-instruct")
),
timeout=28800, # 8 hours max
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

timeout seems too generic, it may be useful to be more specific.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I change it active_deadline_seconds ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it looks like we agreed to have active_deadline_seconds for Katib SDK previously: kubeflow/katib#2568 (comment)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so I will change it to active_deadline_seconds.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good / more specific to me. There might be other types of timeouts in the future.

// +optional
// +kubebuilder:validation:Minimum=1
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="field is immutable"
ActiveDeadlineSeconds *int64 `json:"activeDeadlineSeconds,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks overly complicated at first glance.

@andreyvelich
Copy link
Member

/ok-to-test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TTL for TrainJobs

4 participants