feat(api): Add terminationGracePeriodSeconds to PodSpecPatch in TrainJob#3324
feat(api): Add terminationGracePeriodSeconds to PodSpecPatch in TrainJob#3324krishdef7 wants to merge 1 commit intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
5d3b20c to
73143f5
Compare
There was a problem hiding this comment.
Pull request overview
This PR extends the TrainJob RuntimePatches API to allow per-TrainJob overrides of terminationGracePeriodSeconds via PodSpecPatch, enabling longer graceful shutdowns for workloads like PyTorch Elastic that need time to checkpoint on SIGTERM.
Changes:
- Add
TerminationGracePeriodSeconds *int64(Minimum=0) toPodSpecPatchand propagate it through generated Go/OpenAPI/CRD artifacts. - Update the Python client model to include
terminationGracePeriodSeconds. - Add integration + webhook coverage to validate schema enforcement and verify the field propagates into JobSet pod specs.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
pkg/apis/trainer/v1alpha1/trainjob_types.go |
Adds the new PodSpecPatch field with kubebuilder minimum validation. |
manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml |
Exposes the field in the TrainJob CRD schema (min 0). |
pkg/client/applyconfiguration/trainer/v1alpha1/podspecpatch.go |
Extends server-side apply configuration with a WithTerminationGracePeriodSeconds builder. |
pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go |
Regenerates deepcopy logic to include the new pointer field. |
pkg/apis/trainer/v1alpha1/zz_generated.openapi.go |
Regenerates OpenAPI schema to include the new field. |
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_pod_spec_patch.py |
Updates the Python client model to support the new field. |
pkg/util/testing/wrapper.go |
Adds a JobSetWrapper helper to set terminationGracePeriodSeconds in expected JobSet pod specs. |
test/integration/webhooks/trainjob_test.go |
Adds webhook validation tests for valid and invalid (negative) values. |
test/integration/controller/trainjob_controller_test.go |
Adds integration coverage ensuring the patch propagates into the created JobSet pod specs. |
d1848ce to
8a3aa76
Compare
Adds terminationGracePeriodSeconds field to PodSpecPatch so users can configure pod termination grace period per TrainJob via RuntimePatches. This is needed for distributed training with PyTorch Elastic (torchrun) where large models (70B+ parameters) require more than the default 30s to complete JIT checkpointing before SIGKILL on node drain or TrainJob pause. No changes to merge logic in trainingruntime.go are required since the existing StrategicMergePatch applied at batchv1.JobTemplateSpec level already handles this field automatically. Closes kubeflow#3285 Signed-off-by: krishdef7 <gargkrish06@gmail.com>
8a3aa76 to
6095a15
Compare
What this PR does / why we need it
Adds
terminationGracePeriodSecondstoPodSpecPatchso users can configure pod termination grace period per TrainJob viaRuntimePatches, without requiring cluster-admin access to modify the TrainingRuntime.This is needed for distributed training with PyTorch Elastic (
torchrun). When a TrainJob is paused or a node is drained, Kubelet sends SIGTERM to the pod. The TorchElastic agent propagates this to worker processes, which perform JIT checkpointing before shutdown. For large models (70B+ parameters), the default 30-second grace period is insufficient to complete checkpoint saves to disk or remote storage (S3/PVC).Users need to set
terminationGracePeriodSecondsalongsideTORCH_ELASTIC_SHUTDOWN_TIMEOUT(configurable since pytorch/pytorch#172596) to give workers enough time to save state before SIGKILL.Previously, this field could only be set in the TrainingRuntime template, affecting all jobs using that runtime. This PR exposes it as a per-TrainJob override consistent with other
PodSpecPatchfields likenodeSelectorandtolerations.No changes to merge logic in
trainingruntime.goare required — the existingStrategicMergePatchapplied atbatchv1.JobTemplateSpeclevel already handles this field automatically.Which issue(s) this PR fixes
Fixes #3285
Changes
TerminationGracePeriodSeconds *int64toPodSpecPatchwith// +kubebuilder:validation:Minimum=0markerzz_generated.deepcopy.go,zz_generated.openapi.go,podspecpatch.goapplyconfiguration, Python client model, and CRDTerminationGracePeriodSecondshelper toJobSetWrapperin test utilitiesChecklist