Skip to content

feat(api): Add terminationGracePeriodSeconds to PodSpecPatch in TrainJob#3324

Open
krishdef7 wants to merge 1 commit intokubeflow:masterfrom
krishdef7:feat/termination-grace-period-patch
Open

feat(api): Add terminationGracePeriodSeconds to PodSpecPatch in TrainJob#3324
krishdef7 wants to merge 1 commit intokubeflow:masterfrom
krishdef7:feat/termination-grace-period-patch

Conversation

@krishdef7
Copy link
Contributor

What this PR does / why we need it

Adds terminationGracePeriodSeconds to PodSpecPatch so users can configure pod termination grace period per TrainJob via RuntimePatches, without requiring cluster-admin access to modify the TrainingRuntime.

This is needed for distributed training with PyTorch Elastic (torchrun). When a TrainJob is paused or a node is drained, Kubelet sends SIGTERM to the pod. The TorchElastic agent propagates this to worker processes, which perform JIT checkpointing before shutdown. For large models (70B+ parameters), the default 30-second grace period is insufficient to complete checkpoint saves to disk or remote storage (S3/PVC).

Users need to set terminationGracePeriodSeconds alongside TORCH_ELASTIC_SHUTDOWN_TIMEOUT (configurable since pytorch/pytorch#172596) to give workers enough time to save state before SIGKILL.

Previously, this field could only be set in the TrainingRuntime template, affecting all jobs using that runtime. This PR exposes it as a per-TrainJob override consistent with other PodSpecPatch fields like nodeSelector and tolerations.

No changes to merge logic in trainingruntime.go are required — the existing StrategicMergePatch applied at batchv1.JobTemplateSpec level already handles this field automatically.

Which issue(s) this PR fixes

Fixes #3285

Changes

  • Add TerminationGracePeriodSeconds *int64 to PodSpecPatch with // +kubebuilder:validation:Minimum=0 marker
  • Regenerate zz_generated.deepcopy.go, zz_generated.openapi.go, podspecpatch.go applyconfiguration, Python client model, and CRD
  • Add TerminationGracePeriodSeconds helper to JobSetWrapper in test utilities
  • Add integration test verifying propagation to JobSet pods
  • Add webhook tests for valid (300s) and invalid (-1) values

Checklist

  • Docs included if any changes are user facing

Copilot AI review requested due to automatic review settings March 13, 2026 10:55
@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the TrainJob RuntimePatches API to allow per-TrainJob overrides of terminationGracePeriodSeconds via PodSpecPatch, enabling longer graceful shutdowns for workloads like PyTorch Elastic that need time to checkpoint on SIGTERM.

Changes:

  • Add TerminationGracePeriodSeconds *int64 (Minimum=0) to PodSpecPatch and propagate it through generated Go/OpenAPI/CRD artifacts.
  • Update the Python client model to include terminationGracePeriodSeconds.
  • Add integration + webhook coverage to validate schema enforcement and verify the field propagates into JobSet pod specs.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
pkg/apis/trainer/v1alpha1/trainjob_types.go Adds the new PodSpecPatch field with kubebuilder minimum validation.
manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml Exposes the field in the TrainJob CRD schema (min 0).
pkg/client/applyconfiguration/trainer/v1alpha1/podspecpatch.go Extends server-side apply configuration with a WithTerminationGracePeriodSeconds builder.
pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go Regenerates deepcopy logic to include the new pointer field.
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go Regenerates OpenAPI schema to include the new field.
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_pod_spec_patch.py Updates the Python client model to support the new field.
pkg/util/testing/wrapper.go Adds a JobSetWrapper helper to set terminationGracePeriodSeconds in expected JobSet pod specs.
test/integration/webhooks/trainjob_test.go Adds webhook validation tests for valid and invalid (negative) values.
test/integration/controller/trainjob_controller_test.go Adds integration coverage ensuring the patch propagates into the created JobSet pod specs.

@krishdef7 krishdef7 force-pushed the feat/termination-grace-period-patch branch 4 times, most recently from d1848ce to 8a3aa76 Compare March 13, 2026 11:36
Adds terminationGracePeriodSeconds field to PodSpecPatch so users can
configure pod termination grace period per TrainJob via RuntimePatches.

This is needed for distributed training with PyTorch Elastic (torchrun)
where large models (70B+ parameters) require more than the default 30s
to complete JIT checkpointing before SIGKILL on node drain or TrainJob
pause.

No changes to merge logic in trainingruntime.go are required since the
existing StrategicMergePatch applied at batchv1.JobTemplateSpec level
already handles this field automatically.

Closes kubeflow#3285

Signed-off-by: krishdef7 <gargkrish06@gmail.com>
@krishdef7 krishdef7 force-pushed the feat/termination-grace-period-patch branch from 8a3aa76 to 6095a15 Compare March 14, 2026 05:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add terminationGracePeriodSeconds to PodSpecPatch in TrainJob

2 participants