You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _posts/2026-3-20-introducing-kubeflow-trainer-v2.2.md
+11-11Lines changed: 11 additions & 11 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -148,7 +148,7 @@ You can learn more about this in our [Flux Guide](https://www.kubeflow.org/docs/
148
148
149
149
Previously, TrainJob resources persisted in the cluster indefinitely after completion unless manually removed, which led to Etcd bloat, resource contention and no automatic garbage collection. A job could also get stuck or run indefinitely, wasting CPU/GPU capacity and reducing cluster efficiency. In v2.2, Kubeflow Trainer adds support forActiveDeadlineSeconds APIin TrainJob. This field lets users set a hard timeout (in seconds) for a TrainJob’s active execution timeline. When the deadline is exceeded, Trainer marks the TrainJob as Failed (reason: `DeadlineExceeded`), terminates the running workload, and deletes the underlying JobSet.
150
150
151
-
### Technical example:
151
+
### Get Started
152
152
153
153
There’s a couple ways to specify the timeout limit of a job, the first one is by modifying the TrainJob manifest directly:
154
154
@@ -166,13 +166,13 @@ trainer:
166
166
numNodes: 2
167
167
```
168
168
169
-
## Explicit Ownership for TrainJobs with RuntimePatches API
169
+
## RuntimePatches API to override TrainJob defaults
170
170
171
171
In many distributed learning environments, multiple controllers can interact with the same TrainJob manifest, making ownership boundaries really important to preserve. The new RuntimePatches API replaces PodTemplateOverrides with a manager-keyed structure that makes it explicit on who applied what and when.
172
172
173
173
Each patch is scoped to a named manager and can target specific jobs or pods within the runtime, with both job-level and pod-level overrides supported. This means Kueue can inject node selectors and tolerations into the trainer pod without conflicting with another controller managing job-level metadata, and the full history of what was applied is preserved directly in the spec.
174
174
175
-
### Technical example:
175
+
### Get Started
176
176
177
177
In the new TrainJob manifest, every manager owns its own entry, pod and job overrides are separate fields under that manager. Note that your manager field will be **immutable** after creation.
178
178
@@ -204,7 +204,7 @@ spec:
204
204
205
205
Note that the RuntimePatches API cannot be used to set environment variables for the node, dataset-initializer, or model-initializer containers, nor to override command, args, image, or resources on the trainer container.
206
206
207
-
For a more complete description of the API's structure, restrictions and use cases, check out the [RuntimePatches operator guide](https://www.kubeflow.org/docs/components/trainer/operator-guides/runtime-patches/#runtimepatches-overview).
207
+
For a more complete description of the API's structure, restrictions and use cases, check out the [RuntimePatches Operator Guide](https://www.kubeflow.org/docs/components/trainer/operator-guides/runtime-patches/#runtimepatches-overview).
208
208
209
209
210
210
@@ -213,23 +213,23 @@ For a more complete description of the API's structure, restrictions and use cas
213
213
>PodTemplateOverrides has been removed in v2.2. If you’re currently using it in your TrainJob manifests, you’ll need to migrate to the RuntimePatches API.
214
214
215
215
216
-
## Infrastructure & Breaking Changes
216
+
## Breaking Changes
217
217
218
218
This release introduces a set of architectural improvements and breaking changes that lay the foundations for a more scalable and modularized Trainer. Please review the following when upgrading to Trainer v2.2:
219
219
220
-
### Required: Migrating to RuntimePatches API
220
+
### Replace PodTemplateOverrides with RuntimePatches API
221
221
222
222
As mentioned above, PodTemplateOverrides has been replaced with RuntimePatches API to support manager-scoped customization and prevent conflicts when multiple controllers are patching the same TrainJob.
223
223
224
-
If you are using PodTemplateOverrides in your TrainJob manifests or SDK code, you will need to migrate to the manager-keyed RuntimePatches structure. See the [RuntimePatches](#customize-runtime-configs:-runtimepatchesapi) section above for the full API shape and examples.
224
+
If you are using PodTemplateOverrides in your TrainJob manifests or SDK code, you will need to migrate to the manager-keyed RuntimePatches structure. See the [RuntimePatches Operator Guide](https://www.kubeflow.org/docs/components/trainer/operator-guides/runtime-patches/#runtimepatches-overview), and [Options Reference](https://sdk.kubeflow.org/en/latest/train/options.html) for more information.
225
225
226
-
### Required: Remove numProcPerNode from Torch API
226
+
### Remove numProcPerNode from the Torch MLPolicy API
227
227
228
228
The numProcPerNode field has been removed from the Torch MLPolicy. Process-per-node configuration is now handled directly through the container resources, so any TrainJob manifests or SDK calls that set numProcPerNode explicitly will need to be updated before upgrading to v2.2.
229
229
230
-
### Required: Remove ElasticPolicy API
230
+
### Remove ElasticPolicy API
231
231
232
-
We no longer support the ElasticPolicy API from the MLPolicy as part of Trainer v2.2. If your TrainJobs rely on elastic training configuration through this API, you will need to migrate to the updated approach before upgrading.
232
+
The ElasticPolicy API has been removed from MLPolicy in Trainer v2.2. Elastic training is not yet available in this release, we are actively working on a [redesigned implementation](https://github.com/kubeflow/trainer/issues/2903) for future release. If your TrainJobs rely on elastic training configuration, please hold off on upgrading until that work lands.
233
233
234
234
### Some TrainJob API fields are now immutable
235
235
@@ -278,5 +278,5 @@ The Kubeflow Trainer is built by and for the community. We welcome contributions
278
278
* View the full [Changelog](https://github.com/kubeflow/trainer/blob/master/CHANGELOG.md).
279
279
* Explore the [Kubeflow Trainer docs](https://www.kubeflow.org/docs/components/trainer/)
280
280
281
-
**Headed to [KubeCon](https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/)?** Stop by the Kubeflow booth to see these features in action 😸🧊\!\!
281
+
**Headed to [KubeCon + CloudNativeCon 2026 EU](https://events.linuxfoundation.org/kubecon-cloudnativecon-europe/)?** Stop by the Kubeflow booth to see these features in action 😸🧊\!\!
0 commit comments