Skip to content

Commit e28c5a0

Browse files
committed
elastic api and minor changes
1 parent c0255aa commit e28c5a0

1 file changed

Lines changed: 11 additions & 11 deletions

File tree

_posts/2026-3-20-introducing-kubeflow-trainer-v2.2.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -148,7 +148,7 @@ You can learn more about this in our [Flux Guide](https://www.kubeflow.org/docs/
148148
149149
Previously, TrainJob resources persisted in the cluster indefinitely after completion unless manually removed, which led to Etcd bloat, resource contention and no automatic garbage collection. A job could also get stuck or run indefinitely, wasting CPU/GPU capacity and reducing cluster efficiency. In v2.2, Kubeflow Trainer adds support for ActiveDeadlineSeconds API in TrainJob. This field lets users set a hard timeout (in seconds) for a TrainJob’s active execution timeline. When the deadline is exceeded, Trainer marks the TrainJob as Failed (reason: `DeadlineExceeded`), terminates the running workload, and deletes the underlying JobSet.
150150
151-
### Technical example:
151+
### Get Started
152152
153153
There’s a couple ways to specify the timeout limit of a job, the first one is by modifying the TrainJob manifest directly:
154154
@@ -166,13 +166,13 @@ trainer:
166166
numNodes: 2
167167
```
168168
169-
## Explicit Ownership for TrainJobs with RuntimePatches API
169+
## RuntimePatches API to override TrainJob defaults
170170
171171
In many distributed learning environments, multiple controllers can interact with the same TrainJob manifest, making ownership boundaries really important to preserve. The new RuntimePatches API replaces PodTemplateOverrides with a manager-keyed structure that makes it explicit on who applied what and when.
172172
173173
Each patch is scoped to a named manager and can target specific jobs or pods within the runtime, with both job-level and pod-level overrides supported. This means Kueue can inject node selectors and tolerations into the trainer pod without conflicting with another controller managing job-level metadata, and the full history of what was applied is preserved directly in the spec.
174174
175-
### Technical example:
175+
### Get Started
176176
177177
In the new TrainJob manifest, every manager owns its own entry, pod and job overrides are separate fields under that manager. Note that your manager field will be **immutable** after creation.
178178
@@ -204,7 +204,7 @@ spec:
204204
205205
Note that the RuntimePatches API cannot be used to set environment variables for the node, dataset-initializer, or model-initializer containers, nor to override command, args, image, or resources on the trainer container.
206206
207-
For a more complete description of the API's structure, restrictions and use cases, check out the [RuntimePatches operator guide](https://www.kubeflow.org/docs/components/trainer/operator-guides/runtime-patches/#runtimepatches-overview).
207+
For a more complete description of the API's structure, restrictions and use cases, check out the [RuntimePatches Operator Guide](https://www.kubeflow.org/docs/components/trainer/operator-guides/runtime-patches/#runtimepatches-overview).
208208
209209
210210
@@ -213,23 +213,23 @@ For a more complete description of the API's structure, restrictions and use cas
213213
>PodTemplateOverrides has been removed in v2.2. If you’re currently using it in your TrainJob manifests, you’ll need to migrate to the RuntimePatches API.
214214
215215
216-
## Infrastructure & Breaking Changes
216+
## Breaking Changes
217217
218218
This release introduces a set of architectural improvements and breaking changes that lay the foundations for a more scalable and modularized Trainer. Please review the following when upgrading to Trainer v2.2:
219219
220-
### Required: Migrating to RuntimePatches API
220+
### Replace PodTemplateOverrides with RuntimePatches API
221221
222222
As mentioned above, PodTemplateOverrides has been replaced with RuntimePatches API to support manager-scoped customization and prevent conflicts when multiple controllers are patching the same TrainJob.
223223
224-
If you are using PodTemplateOverrides in your TrainJob manifests or SDK code, you will need to migrate to the manager-keyed RuntimePatches structure. See the [RuntimePatches](#customize-runtime-configs:-runtimepatchesapi) section above for the full API shape and examples.
224+
If you are using PodTemplateOverrides in your TrainJob manifests or SDK code, you will need to migrate to the manager-keyed RuntimePatches structure. See the [RuntimePatches Operator Guide](https://www.kubeflow.org/docs/components/trainer/operator-guides/runtime-patches/#runtimepatches-overview), and [Options Reference](https://sdk.kubeflow.org/en/latest/train/options.html) for more information.
225225
226-
### Required: Remove numProcPerNode from Torch API
226+
### Remove numProcPerNode from the Torch MLPolicy API
227227
228228
The numProcPerNode field has been removed from the Torch MLPolicy. Process-per-node configuration is now handled directly through the container resources, so any TrainJob manifests or SDK calls that set numProcPerNode explicitly will need to be updated before upgrading to v2.2.
229229
230-
### Required: Remove ElasticPolicy API
230+
### Remove ElasticPolicy API
231231
232-
We no longer support the ElasticPolicy API from the MLPolicy as part of Trainer v2.2. If your TrainJobs rely on elastic training configuration through this API, you will need to migrate to the updated approach before upgrading.
232+
The ElasticPolicy API has been removed from MLPolicy in Trainer v2.2. Elastic training is not yet available in this release, we are actively working on a [redesigned implementation](https://github.com/kubeflow/trainer/issues/2903) for future release. If your TrainJobs rely on elastic training configuration, please hold off on upgrading until that work lands.
233233
234234
### Some TrainJob API fields are now immutable
235235
@@ -278,5 +278,5 @@ The Kubeflow Trainer is built by and for the community. We welcome contributions
278278
* View the full [Changelog](https://github.com/kubeflow/trainer/blob/master/CHANGELOG.md).
279279
* Explore the [Kubeflow Trainer docs](https://www.kubeflow.org/docs/components/trainer/)
280280
281-
**Headed to [KubeCon](https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/)?** Stop by the Kubeflow booth to see these features in action 😸🧊\!\!
281+
**Headed to [KubeCon + CloudNativeCon 2026 EU](https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/)?** Stop by the Kubeflow booth to see these features in action 😸🧊\!\!
282282

0 commit comments

Comments
 (0)