feat(api): Add terminationGracePeriodSeconds to PodSpecPatch in TrainJob by krishdef7 · Pull Request #3324 · kubeflow/trainer

krishdef7 · 2026-03-13T10:55:45Z

What this PR does / why we need it

Adds terminationGracePeriodSeconds to PodSpecPatch so users can configure pod termination grace period per TrainJob via RuntimePatches, without requiring cluster-admin access to modify the TrainingRuntime.

This is needed for distributed training with PyTorch Elastic (torchrun). When a TrainJob is paused or a node is drained, Kubelet sends SIGTERM to the pod. The TorchElastic agent propagates this to worker processes, which perform JIT checkpointing before shutdown. For large models (70B+ parameters), the default 30-second grace period is insufficient to complete checkpoint saves to disk or remote storage (S3/PVC).

Users need to set terminationGracePeriodSeconds alongside TORCH_ELASTIC_SHUTDOWN_TIMEOUT (configurable since pytorch/pytorch#172596) to give workers enough time to save state before SIGKILL.

Previously, this field could only be set in the TrainingRuntime template, affecting all jobs using that runtime. This PR exposes it as a per-TrainJob override consistent with other PodSpecPatch fields like nodeSelector and tolerations.

No changes to merge logic in trainingruntime.go are required — the existing StrategicMergePatch applied at batchv1.JobTemplateSpec level already handles this field automatically.

Which issue(s) this PR fixes

Fixes #3285

Changes

Add TerminationGracePeriodSeconds *int64 to PodSpecPatch with // +kubebuilder:validation:Minimum=0 marker
Regenerate zz_generated.deepcopy.go, zz_generated.openapi.go, podspecpatch.go applyconfiguration, Python client model, and CRD
Add TerminationGracePeriodSeconds helper to JobSetWrapper in test utilities
Add integration test verifying propagation to JobSet pods
Add webhook tests for valid (300s) and invalid (-1) values

Checklist

Docs included if any changes are user facing

google-oss-prow · 2026-03-13T10:55:54Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-03-13T10:55:55Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR extends the TrainJob RuntimePatches API to allow per-TrainJob overrides of terminationGracePeriodSeconds via PodSpecPatch, enabling longer graceful shutdowns for workloads like PyTorch Elastic that need time to checkpoint on SIGTERM.

Changes:

Add TerminationGracePeriodSeconds *int64 (Minimum=0) to PodSpecPatch and propagate it through generated Go/OpenAPI/CRD artifacts.
Update the Python client model to include terminationGracePeriodSeconds.
Add integration + webhook coverage to validate schema enforcement and verify the field propagates into JobSet pod specs.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
`pkg/apis/trainer/v1alpha1/trainjob_types.go`	Adds the new `PodSpecPatch` field with kubebuilder minimum validation.
`manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml`	Exposes the field in the TrainJob CRD schema (min 0).
`pkg/client/applyconfiguration/trainer/v1alpha1/podspecpatch.go`	Extends server-side apply configuration with a `WithTerminationGracePeriodSeconds` builder.
`pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go`	Regenerates deepcopy logic to include the new pointer field.
`pkg/apis/trainer/v1alpha1/zz_generated.openapi.go`	Regenerates OpenAPI schema to include the new field.
`api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_pod_spec_patch.py`	Updates the Python client model to support the new field.
`pkg/util/testing/wrapper.go`	Adds a JobSetWrapper helper to set `terminationGracePeriodSeconds` in expected JobSet pod specs.
`test/integration/webhooks/trainjob_test.go`	Adds webhook validation tests for valid and invalid (negative) values.
`test/integration/controller/trainjob_controller_test.go`	Adds integration coverage ensuring the patch propagates into the created JobSet pod specs.

Adds terminationGracePeriodSeconds field to PodSpecPatch so users can configure pod termination grace period per TrainJob via RuntimePatches. This is needed for distributed training with PyTorch Elastic (torchrun) where large models (70B+ parameters) require more than the default 30s to complete JIT checkpointing before SIGKILL on node drain or TrainJob pause. No changes to merge logic in trainingruntime.go are required since the existing StrategicMergePatch applied at batchv1.JobTemplateSpec level already handles this field automatically. Closes kubeflow#3285 Signed-off-by: krishdef7 <gargkrish06@gmail.com>

Copilot AI review requested due to automatic review settings March 13, 2026 10:55

google-oss-prow bot requested review from akshaychitneni and jinchihe March 13, 2026 10:55

google-oss-prow bot added the size/L label Mar 13, 2026

Copilot started reviewing on behalf of krishdef7 March 13, 2026 10:56 View session

krishdef7 force-pushed the feat/termination-grace-period-patch branch from 5d3b20c to 73143f5 Compare March 13, 2026 10:56

Copilot AI reviewed Mar 13, 2026

View reviewed changes

krishdef7 force-pushed the feat/termination-grace-period-patch branch 4 times, most recently from d1848ce to 8a3aa76 Compare March 13, 2026 11:36

krishdef7 force-pushed the feat/termination-grace-period-patch branch from 8a3aa76 to 6095a15 Compare March 14, 2026 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(api): Add terminationGracePeriodSeconds to PodSpecPatch in TrainJob#3324

feat(api): Add terminationGracePeriodSeconds to PodSpecPatch in TrainJob#3324
krishdef7 wants to merge 1 commit intokubeflow:masterfrom
krishdef7:feat/termination-grace-period-patch

krishdef7 commented Mar 13, 2026

Uh oh!

google-oss-prow bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

krishdef7 commented Mar 13, 2026

What this PR does / why we need it

Which issue(s) this PR fixes

Changes

Checklist

Uh oh!

google-oss-prow bot commented Mar 13, 2026

Uh oh!

github-actions bot commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants