Skip to content

V2.1.0 odh rebase#18

Closed
sutaakar wants to merge 119 commits intoopendatahub-io:mainfrom
sutaakar:odh-rebase
Closed

V2.1.0 odh rebase#18
sutaakar wants to merge 119 commits intoopendatahub-io:mainfrom
sutaakar:odh-rebase

Conversation

@sutaakar
Copy link
Copy Markdown
Collaborator

@sutaakar sutaakar commented Nov 11, 2025

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
Fixes #

Checklist:

  • Docs included if any changes are user facing

Summary by CodeRabbit

Release Notes

  • New Features

    • Added validation to prevent setting dependencies on the first replicated job in a training run.
  • Bug Fixes

    • Made PodSpecOverrides and runtimeRef immutable to prevent unintended runtime modifications.
  • Breaking Changes

    • Removed scheduling gates field from pod specification overrides.
    • Removed TorchTune runtime from default deployment.
  • Chores

    • Updated base container images: PyTorch (2.7.1 → 2.5.0), CUDA (12.8 → 12.4).
    • Updated JobSet dependency version (v0.8.2 → v0.8.0).
    • Removed RBAC bindings for platform-level Kubeflow integration.

kramaranya and others added 30 commits June 11, 2025 08:18
* fix(docs): convert commits to list in changelog.py for compatibility

Signed-off-by: kramaranya <kramaranya15@gmail.com>

* chore(docs): add Changelog for Trainer v2.0.0-rc.0

Signed-off-by: kramaranya <kramaranya15@gmail.com>

---------

Signed-off-by: kramaranya <kramaranya15@gmail.com>
…nShift (kubeflow#2682)

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
…#2685)

* chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update cuda to 12.8

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…w#2382)

* Add the manifests overlay for Kubeflow Training V2

Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update manifest: adjust permissions, and format changes

Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update manifest: rename overlay, adjust event permissions

Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update manifest: make namespace configurable

Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update manifest: move standalone, only-manager installation in namespace: kubeflow-system

Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update manifest: add overlay for Kubeflow Platform installation

Signed-off-by: Xinmin Du <2812493086@qq.com>

* add permission for pods log read & rm persistentvolumeclaims

Signed-off-by: Xinmin Du <2812493086@qq.com>

* create the runtimes before the webhooks

Signed-off-by: Xinmin Du <2812493086@qq.com>

* Specify sorting order: fifo

Signed-off-by: Xinmin Du <2812493086@qq.com>

* Deploy jobset first

Signed-off-by: Xinmin Du <2812493086@qq.com>

* remove edit permissions to runtimes; install runtimes after crds

Signed-off-by: Xinmin Du <2812493086@qq.com>

* remove pretraining directory

Signed-off-by: Xinmin Du <2812493086@qq.com>

* patch runtimes images

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: correct image

Signed-off-by: Xinmin Du <2812493086@qq.com>

* add image patch for more runtimes

Signed-off-by: Xinmin Du <2812493086@qq.com>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

* Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml

Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>

* role_bind for notebook & profile

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: reorder images

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: reuse overlay/manager & runtimes

Signed-off-by: Xinmin Du <2812493086@qq.com>

* fix: remove namespace with patch

Signed-off-by: Xinmin Du <2812493086@qq.com>

---------

Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Du Xinmin <2812493086@qq.com>
Co-authored-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…ith CTR and TrainJob yaml files (kubeflow#2669)

* chore(mainfests): include torchtune runtimes.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(manifests): Update torchtune runtimes.:

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(manifests): Update mounting path in CTRs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(manifests): Update output_dir.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(manifests): Update numProcPerNode to auto.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
…w#2675)

* fix(plugins): fix errors in trainer command mutation of torchtune.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugins): remove config file format suffix.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(test): update UTs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(initializer): Update the workspace of dataset/model initializer.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(plugins): parse nproc_per_node from GPU resource.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(torchtune): Add bitsandbytes dependency in requirements.txt

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(lint): fix lint error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(torchtune): Remove unnecessary num_proc_per_node calculation.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(torch): Update invalid parameters.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* feat: Mutable PodSpecOverrides for suspended TrainJob

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Include @tenzen-y review

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Add unit tests

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

---------

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* feat(example): Add alpaca-trianjob-yaml.ipynb.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(example): Update the overview of the torchtune llama3_2 example.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(example): Update the pvc description.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* chore(example): Add the get the fine-tuned model section.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(example): Fix some errors.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(example): fix some errors.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(manifests): Fix debug tag.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(example): Change PVC creation method to Python SDK.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(example): Remove config load.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* feat: Add schedulingGates to PodSpecOverrides

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Change desired job to target job in PodSpecOverrides comments

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

---------

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* fix(module): Change Go module name to v2

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Bump x/net to v0.38.0

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* chore(docs): Add Changelog for v2.0.0-rc.1

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Move example to misc

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add Red Hat to ADOPTERS.md

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

* Update ADOPTERS.md

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>

---------

Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
…d to job (kubeflow#2719)

Signed-off-by: rudeigerc <rudeigerc@gmail.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* chore(ci): Add GitHub action to verify PR titles

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Use operator scope

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add examples scope

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add scripts to scope

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add exporter

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* add wip ignore label

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add PR title to the contrib guide

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Ignore dependencies label

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix text

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Use action only on master branch

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…ue template (kubeflow#2732)

Signed-off-by: rudeigerc <rudeigerc@gmail.com>
… jobset (kubeflow#2734)

Signed-off-by: rudeigerc <rudeigerc@gmail.com>
Signed-off-by: Koray Oksay <koray.oksay@gmail.com>
* chore(docs): Add Changelog for Kubeflow Trainer v2.0.0

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add links for blog post and migration guide

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add links for blog post and website

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* feat(docs): Kubeflow Trainer ROADMAP 2025

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update roadmap

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add issue for Trainer UI

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add issues for MPI and plugin extension

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add issues for builtin trainers

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
…ubeflow#2754)

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
astefanutti and others added 21 commits October 17, 2025 14:40
* chore(ci): Enable Kubernetes API Linter

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* lint files

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Generate

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Review

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Fix golangci-lint target

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Fix commands max length

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Re: Fix golangci-lint target

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Enable the integers linter

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Disable maxlength linter

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* make generate

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Make fields optional

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Fix Suspended field in JobStatus

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

---------

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* fix(api): Keep mpiImplementation field a pointer

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Fix integration test

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

---------

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
…s for data_cache (kubeflow#2890)

Signed-off-by: Akshay Chitneni <achitneni@apple.com>
Co-authored-by: Akshay Chitneni <achitneni@apple.com>
…obs (kubeflow#2653)

* feat(runtime): add support for launcher resource allocation in MPI jobs

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add unit tests

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Set numProcPerNode for MPI plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Move util func to runtime package

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix torchtune plugin

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Inline if for GPU check

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Assign container resources once

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Add todo for test wrappers

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…obs (kubeflow#2722)

* feat(webhook): Add validation for required containers in replicatedJobs.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* test(webhook): Add UTs for validation in required containers.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(lint): fix lint error.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

* fix(webhook): add global map & remove launcher check.

Signed-off-by: Electronic-Waste <2690692950@qq.com>

---------

Signed-off-by: Electronic-Waste <2690692950@qq.com>
* feat(manager): add controller manager configuration and configmap support

Signed-off-by: kapil27 <knema@redhat.com>

* refactor: update configmap naming and leader election configuration

Signed-off-by: kapil27 <knema@redhat.com>

* chore: clean up unused lines in configmap and test files

Signed-off-by: kapil27 <knema@redhat.com>

---------

Signed-off-by: kapil27 <knema@redhat.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
kubeflow#2911)

* feat(initializer): add s3 model and dataset initializers

Signed-off-by: rudeigerc <rudeigerc@gmail.com>

* chore: refactor with opendal

Signed-off-by: rudeigerc <rudeigerc@gmail.com>

* chore: support `role_arn` and add `ignore_patterns` field in the Initializers configs

Signed-off-by: rudeigerc <rudeigerc@gmail.com>

---------

Signed-off-by: rudeigerc <rudeigerc@gmail.com>
Co-authored-by: rudeigerc <rudeigerc@gmail.com>
…ubeflow#2912)

* chore(operator): Use SSA throughout runtime framework

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Fix lint error

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

* Update go.mod file

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>

---------

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Co-authored-by: Antonin Stefanutti <antonin@stefanutti.fr>
…harts (kubeflow#2914)

Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Co-authored-by: Antonin Stefanutti <antonin@stefanutti.fr>
…branch (kubeflow#2917)

* feat(manifests): Publish Trainer Helm Charts (kubeflow#2906)

* Solve Remaining Error and bugs

Signed-off-by: adity1raut <araut7798@gmail.com>

* Solve the confige

Signed-off-by: adity1raut <araut7798@gmail.com>

* Update The Suggest Change

Signed-off-by: adity1raut <araut7798@gmail.com>

* Update After REview

Signed-off-by: adity1raut <araut7798@gmail.com>

* Update the Helm publish action

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update release doc

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Use 0.0.0 version for master branch

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update release doc

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: adity1raut <araut7798@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* fix(manifests): Fix Helm charts image name (kubeflow#2915)

* fix(manifests): Fix Helm charts image name

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Always insert appVersion to the Chart.yaml file

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix comment

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Simplify action

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* fix(manifests): Remove the default tag from the controller image (kubeflow#2916)

* fix(manifests): Remove the default tag from the controller image

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Fix README template

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: adity1raut <araut7798@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Aditya Raut <159172287+adity1raut@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…cache nodes (kubeflow#2920)

Signed-off-by: Akshay Chitneni <achitneni@apple.com>
Co-authored-by: Akshay Chitneni <achitneni@apple.com>
…#2924)

* add local docker training example

Signed-off-by: Brian Gallagher <briangal@gmail.com>

* feat: Adding local execution example notebook Co-authored-by Brian Gallagher <bgallagh@redhat.com>

Signed-off-by: Fiona Waters <fiwaters6@gmail.com>

---------

Signed-off-by: Brian Gallagher <briangal@gmail.com>
Signed-off-by: Fiona Waters <fiwaters6@gmail.com>
Co-authored-by: Brian Gallagher <briangal@gmail.com>
Co-authored-by: Fiona Waters <fiwaters6@gmail.com>
…ubeflow#2927)

* fix(ci): Fix the Kubeflow SDK installation with Docker

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Uncomment delete job in local Notebooks

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

* Update .github/workflows/test-e2e.yaml

Co-authored-by: Anya Kramar <akramar@redhat.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>

---------

Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Co-authored-by: Anya Kramar <akramar@redhat.com>
…e and example (kubeflow#2928)

Signed-off-by: Akshay Chitneni <achitneni@apple.com>
Co-authored-by: Akshay Chitneni <achitneni@apple.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
@sutaakar sutaakar changed the title Revert all commits from release-2.0 not in master V2.1.0 odh rebase Nov 11, 2025
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Nov 11, 2025

Walkthrough

This pull request performs a major module path migration from github.com/kubeflow/trainer/v2 to github.com/kubeflow/trainer, removes the schedulingGates field from PodSpecOverride, updates validation plugin signatures, modifies container image versions, and adjusts Kubernetes CRD validation rules and RBAC permissions.

Changes

Cohort / File(s) Summary
Module Path Migration (v2 → v1)
cmd/trainer-controller-manager/main.go, go.mod, hack/swagger/main.go, hack/update-codegen.sh, pkg/apis/trainer/v1alpha1/*, pkg/client/**/**, pkg/constants/constants.go, pkg/controller//, pkg/runtime/**/**, pkg/util/testing/*`
Updated import paths from github.com/kubeflow/trainer/v2/pkg/... to github.com/kubeflow/trainer/pkg/... across Go source files and dependencies. JobSet dependency also downgraded from v0.8.2 to v0.8.0.
PodSpecOverride Field Removal
api/openapi-spec/swagger.json, api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_pod_spec_override.py, pkg/apis/trainer/v1alpha1/trainjob_types.go, pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go, pkg/client/applyconfiguration/trainer/v1alpha1/podspecoverride.go
Removed schedulingGates field and associated serialization logic; updated container and initContainer descriptions from "target" to "desired" job templates.
Validation Plugin Signature Changes
pkg/runtime/core/clustertrainingruntime.go, pkg/runtime/core/trainingruntime.go, pkg/runtime/framework/core/framework.go, pkg/runtime/framework/interface.go, pkg/runtime/framework/plugins/jobset/jobset.go, pkg/runtime/framework/plugins/mpi/mpi.go, pkg/runtime/framework/plugins/torch/torch.go
Removed context.Context parameter from RunCustomValidationPlugins and Validate method signatures across framework and plugin implementations.
CRD Validation Rules
charts/kubeflow-trainer/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml, charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainingruntimes.yaml, manifests/base/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml, manifests/base/crds/trainer.kubeflow.org_trainingruntimes.yaml
Added validation rule preventing dependsOn field on first ReplicatedJob in job sequences; added immutability validations for podSpecOverrides and runtimeRef in TrainJob CRD.
RBAC and Permissions
charts/kubeflow-trainer/templates/rbac/clusterrole.yaml, manifests/base/rbac/role.yaml, manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml, manifests/overlays/kubeflow-platform/role_binding.yaml
Updated ClusterRole verbs (added create/update/patch for several resources); removed event resource permissions; deleted three ClusterRole definitions and two ClusterRoleBinding entries in platform overlay.
Container Image Updates
cmd/runtimes/deepspeed/Dockerfile, cmd/runtimes/deepspeed/requirements.txt, cmd/trainers/torchtune/Dockerfile, cmd/trainers/torchtune/requirements.txt, hack/e2e-setup-cluster.sh, manifests/base/runtimes/torch_distributed.yaml
Downgraded PyTorch from 2.7.1 to 2.5.0, CUDA from 12.8.1 to 12.4.1, DeepSpeed from 0.17.1 to 0.16.4; removed bitsandbytes dependency.
Dockerfile Working Directory Changes
cmd/initializers/dataset/Dockerfile, cmd/initializers/model/Dockerfile
Changed working directory from /app to /workspace.
Helm Chart Configuration
charts/kubeflow-trainer/Chart.yaml, manifests/base/runtimes/kustomization.yaml, manifests/overlays/manager/kustomization.yaml, manifests/overlays/runtimes/kustomization.yaml, manifests/third-party/jobset/kustomization.yaml
Updated JobSet repository and downgraded version; changed controller manager image tag from v2.0.0 to latest; removed torchtune, mlx-runtime, and deepspeed-runtime from kustomization.
TrainJob CRD Schema Restructuring
charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml
Added datasetConfig and modelConfig fields for dataset and model management; expanded PodSpecOverride with envFrom, command, and args fields; removed legacy scheduling gates block.
Runtime Configuration Examples
manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml, manifests/base/runtimes/torchtune/llama3_2/llama3_2_3B.yaml
Updated mount points from /workspace to /workspace/dataset and /workspace/model; changed config paths to include .yaml extension; removed rdzv_endpoint argument; updated numProcPerNode from auto to 2.
Python API and Configuration
api/python_api/pyproject.toml, examples/pytorch/image-classification/mnist.ipynb
Moved build-system block to top, removed Python 3.9+ requirement, added Python 3.8 to classifiers; downgraded PyTorch/torchvision in notebook examples.
Makefile and Build
Makefile
Removed CRD copy step for trainer chart directory.
Documentation and Cleanup
.gitignore, .github/workflows/build-and-push-images.yaml, .github/workflows/test-e2e.yaml, .golangci.yaml, CHANGELOG.md, docs/release/changelog.py, examples/torchtune/llama3_2/alpaca-trainjob-yaml.ipynb, manifests/overlays/kubeflow-platform/kustomization.yaml, manifests/overlays/kubeflow-platform/patches/remove-namespace.yaml
Updated runner specifications to use labels; removed dist/ from gitignore; reformatted golangci.yaml; simplified CHANGELOG; removed torchtune notebook example; deleted platform overlay configuration.

Sequence Diagram(s)

sequenceDiagram
    participant User as User/Webhook
    participant Framework as Framework
    participant Plugin as Validation Plugin
    participant Validator as Kubernetes Validator

    rect rgb(200, 220, 255)
    Note over Framework,Plugin: Old Flow (with context)
    User->>Framework: Validate(ctx, info, oldObj, newObj)
    Framework->>Plugin: RunCustomValidationPlugins(ctx, info, oldObj, newObj)
    Plugin->>Plugin: Validate(ctx, info, oldObj, newObj)
    Plugin-->>Framework: warnings, errors
    end

    rect rgb(220, 240, 200)
    Note over Framework,Plugin: New Flow (context removed)
    User->>Framework: Validate(info, oldObj, newObj)
    Framework->>Plugin: RunCustomValidationPlugins(info, oldObj, newObj)
    Plugin->>Plugin: Validate(info, oldObj, newObj)
    Plugin-->>Framework: warnings, errors
    end

    rect rgb(240, 220, 200)
    Note over User,Validator: CRD Validation Enhancement
    Validator->>Validator: Check dependsOn on replicatedJobs[0]
    alt First job has dependsOn
        Validator-->>User: ✗ Reject (DependsOn can't be set for first ReplicatedJob)
    else Only subsequent jobs have dependsOn
        Validator-->>User: ✓ Accept
    end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Areas requiring extra attention:

  • Module Path Migration Impact: While mostly mechanical, verify all import chains resolve correctly, especially in generated code (zz_generated.deepcopy.go, apply configurations). Check for any missed import path updates in test files and build scripts.
  • Validation Plugin Signature Changes: Confirm all implementations of CustomValidationPlugin.Validate interface have been updated consistently (particularly JobSet, MPI, Torch plugins). Verify context removal doesn't break any async operations or logging that previously relied on it.
  • CRD Validation Rules: Review the new CEL validation rules (!(has(self.replicatedJobs[0].dependsOn))) for correctness and potential edge cases with empty replicatedJobs lists.
  • TrainJob CRD Schema Restructuring: The addition of datasetConfig and modelConfig is substantial; validate backward compatibility if existing TrainJob resources need migration.
  • Image Version Downgrades: PyTorch, CUDA, and dependency downgrades may introduce compatibility issues; verify test coverage reflects these versions.
  • TorchTune Configuration Changes: Path adjustments (.yaml extension, /workspace/model vs /workspace/dataset) need validation against actual TorchTune runtime behavior.

Poem

🐰 Hoppy migration from v2 to home,
Import paths leap with newfound roam,
Context removed, validations clear,
CRD rules now crystal here,
Jobs scheduled strong, first stays alone!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title 'V2.1.0 odh rebase' is vague and does not clearly describe the main changes in the pull request, which involve reverting commits, migrating from v2 module paths to v1, removing deprecated fields, and updating dependencies. Consider a more descriptive title such as 'Migrate module paths from v2 to v1 and revert release-2.0 commits' or 'Rebase ODH trainer to master and migrate module structure'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@sutaakar sutaakar marked this pull request as draft November 11, 2025 08:03
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
.golangci.yaml (1)

1-18: Missing PR description: Cannot properly review this migration without context.

This PR appears to perform a significant module path migration (based on the summary and the new prefix(github.com/kubeflow/trainer) section), yet the PR description is completely empty—only a template stub remains. Without a clear explanation of what this PR does and why, reviewers cannot assess:

  • The scope and scale of the migration
  • The rationale for reverting commits (as the title suggests)
  • Breaking changes or migration impacts
  • Whether all dependent code has been updated consistently

This is a blocker for review and merge.

Please fill in the PR description with:

  • What this PR does (module path migration details)
  • Why it's needed
  • Any breaking changes
  • Migration guidance for users/developers
  • Issues being fixed (reference the tracking issue)
pkg/controller/trainjob_controller.go (1)

94-144: Restore events RBAC to keep EventRecorder working.

Dropping the kubebuilder RBAC rule for core/v1 events leaves the controller service account without permission to create/update/patch events, so every r.recorder.Event(...) call will start failing with Forbidden once deployed. Please add the events rule back.

 // +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=trainjobs/status,verbs=get;update;patch
 // +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=trainjobs/finalizers,verbs=get;update;patch
+// +kubebuilder:rbac:groups="",resources=events,verbs=create;watch;update;patch
charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml (1)

1094-1099: Fix unclear description for nodeSelector override.

Line 1098 description states: "place Pod on the specific mode" which is semantically unclear. This should be "place Pod on specific nodes" to accurately describe the nodeSelector purpose.

- description: Override for the node selector to place Pod on
-   the specific mode.
+ description: Override for the node selector to place Pod on specific nodes.
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0c78f5 and 70b5375.

⛔ Files ignored due to path filters (1)
  • go.sum is excluded by !**/*.sum
📒 Files selected for processing (107)
  • .github/workflows/build-and-push-images.yaml (1 hunks)
  • .github/workflows/test-e2e.yaml (1 hunks)
  • .gitignore (1 hunks)
  • .golangci.yaml (1 hunks)
  • CHANGELOG.md (0 hunks)
  • Makefile (0 hunks)
  • api/openapi-spec/swagger.json (2 hunks)
  • api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_pod_spec_override.py (1 hunks)
  • api/python_api/pyproject.toml (1 hunks)
  • charts/kubeflow-trainer/Chart.yaml (1 hunks)
  • charts/kubeflow-trainer/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml (1 hunks)
  • charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainingruntimes.yaml (1 hunks)
  • charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml (11 hunks)
  • charts/kubeflow-trainer/templates/rbac/clusterrole.yaml (1 hunks)
  • cmd/initializers/dataset/Dockerfile (1 hunks)
  • cmd/initializers/model/Dockerfile (1 hunks)
  • cmd/runtimes/deepspeed/Dockerfile (1 hunks)
  • cmd/runtimes/deepspeed/requirements.txt (1 hunks)
  • cmd/trainer-controller-manager/main.go (1 hunks)
  • cmd/trainers/torchtune/Dockerfile (1 hunks)
  • cmd/trainers/torchtune/requirements.txt (0 hunks)
  • docs/release/changelog.py (1 hunks)
  • examples/pytorch/image-classification/mnist.ipynb (4 hunks)
  • examples/torchtune/llama3_2/alpaca-trainjob-yaml.ipynb (0 hunks)
  • go.mod (2 hunks)
  • hack/e2e-setup-cluster.sh (2 hunks)
  • hack/swagger/main.go (2 hunks)
  • hack/update-codegen.sh (1 hunks)
  • manifests/base/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml (1 hunks)
  • manifests/base/crds/trainer.kubeflow.org_trainingruntimes.yaml (1 hunks)
  • manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml (3 hunks)
  • manifests/base/rbac/role.yaml (1 hunks)
  • manifests/base/runtimes/kustomization.yaml (0 hunks)
  • manifests/base/runtimes/torch_distributed.yaml (1 hunks)
  • manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml (4 hunks)
  • manifests/base/runtimes/torchtune/llama3_2/llama3_2_3B.yaml (4 hunks)
  • manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml (0 hunks)
  • manifests/overlays/kubeflow-platform/kustomization.yaml (0 hunks)
  • manifests/overlays/kubeflow-platform/patches/remove-namespace.yaml (0 hunks)
  • manifests/overlays/kubeflow-platform/role_binding.yaml (0 hunks)
  • manifests/overlays/manager/kustomization.yaml (1 hunks)
  • manifests/overlays/runtimes/kustomization.yaml (1 hunks)
  • manifests/third-party/jobset/kustomization.yaml (1 hunks)
  • pkg/apis/trainer/v1alpha1/trainjob_types.go (2 hunks)
  • pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go (0 hunks)
  • pkg/client/applyconfiguration/trainer/v1alpha1/mpimlpolicysource.go (1 hunks)
  • pkg/client/applyconfiguration/trainer/v1alpha1/podspecoverride.go (0 hunks)
  • pkg/client/applyconfiguration/utils.go (1 hunks)
  • pkg/client/clientset/versioned/clientset.go (1 hunks)
  • pkg/client/clientset/versioned/fake/clientset_generated.go (1 hunks)
  • pkg/client/clientset/versioned/fake/register.go (1 hunks)
  • pkg/client/clientset/versioned/scheme/register.go (1 hunks)
  • pkg/client/clientset/versioned/typed/trainer/v1alpha1/clustertrainingruntime.go (1 hunks)
  • pkg/client/clientset/versioned/typed/trainer/v1alpha1/fake/fake_clustertrainingruntime.go (1 hunks)
  • pkg/client/clientset/versioned/typed/trainer/v1alpha1/fake/fake_trainer_client.go (1 hunks)
  • pkg/client/clientset/versioned/typed/trainer/v1alpha1/fake/fake_trainingruntime.go (1 hunks)
  • pkg/client/clientset/versioned/typed/trainer/v1alpha1/fake/fake_trainjob.go (1 hunks)
  • pkg/client/clientset/versioned/typed/trainer/v1alpha1/trainer_client.go (1 hunks)
  • pkg/client/clientset/versioned/typed/trainer/v1alpha1/trainingruntime.go (1 hunks)
  • pkg/client/clientset/versioned/typed/trainer/v1alpha1/trainjob.go (1 hunks)
  • pkg/client/informers/externalversions/factory.go (1 hunks)
  • pkg/client/informers/externalversions/generic.go (1 hunks)
  • pkg/client/informers/externalversions/internalinterfaces/factory_interfaces.go (1 hunks)
  • pkg/client/informers/externalversions/trainer/interface.go (1 hunks)
  • pkg/client/informers/externalversions/trainer/v1alpha1/clustertrainingruntime.go (1 hunks)
  • pkg/client/informers/externalversions/trainer/v1alpha1/interface.go (1 hunks)
  • pkg/client/informers/externalversions/trainer/v1alpha1/trainingruntime.go (1 hunks)
  • pkg/client/informers/externalversions/trainer/v1alpha1/trainjob.go (1 hunks)
  • pkg/client/listers/trainer/v1alpha1/clustertrainingruntime.go (1 hunks)
  • pkg/client/listers/trainer/v1alpha1/trainingruntime.go (1 hunks)
  • pkg/client/listers/trainer/v1alpha1/trainjob.go (1 hunks)
  • pkg/constants/constants.go (1 hunks)
  • pkg/controller/clustertrainingruntime_controller.go (2 hunks)
  • pkg/controller/clustertrainingruntime_controller_test.go (1 hunks)
  • pkg/controller/setup.go (1 hunks)
  • pkg/controller/trainingruntime_controller.go (2 hunks)
  • pkg/controller/trainingruntime_controller_test.go (1 hunks)
  • pkg/controller/trainjob_controller.go (1 hunks)
  • pkg/runtime/core/clustertrainingruntime.go (2 hunks)
  • pkg/runtime/core/clustertrainingruntime_test.go (1 hunks)
  • pkg/runtime/core/core.go (1 hunks)
  • pkg/runtime/core/registry.go (1 hunks)
  • pkg/runtime/core/trainingruntime.go (2 hunks)
  • pkg/runtime/core/trainingruntime_test.go (6 hunks)
  • pkg/runtime/framework/core/framework.go (2 hunks)
  • pkg/runtime/framework/core/framework_test.go (2 hunks)
  • pkg/runtime/framework/interface.go (2 hunks)
  • pkg/runtime/framework/plugins/coscheduling/coscheduling.go (1 hunks)
  • pkg/runtime/framework/plugins/coscheduling/indexer.go (1 hunks)
  • pkg/runtime/framework/plugins/coscheduling/indexer_test.go (1 hunks)
  • pkg/runtime/framework/plugins/jobset/builder.go (1 hunks)
  • pkg/runtime/framework/plugins/jobset/jobset.go (2 hunks)
  • pkg/runtime/framework/plugins/jobset/jobset_test.go (2 hunks)
  • pkg/runtime/framework/plugins/mpi/mpi.go (2 hunks)
  • pkg/runtime/framework/plugins/mpi/mpi_test.go (2 hunks)
  • pkg/runtime/framework/plugins/plainml/plainml.go (1 hunks)
  • pkg/runtime/framework/plugins/plainml/plainml_test.go (1 hunks)
  • pkg/runtime/framework/plugins/registry.go (1 hunks)
  • pkg/runtime/framework/plugins/torch/torch.go (4 hunks)
  • pkg/runtime/framework/plugins/torch/torch_test.go (11 hunks)
  • pkg/runtime/indexer/indexer.go (1 hunks)
  • pkg/runtime/indexer/indexer_test.go (1 hunks)
  • pkg/runtime/interface.go (1 hunks)
  • pkg/runtime/runtime.go (1 hunks)
  • pkg/runtime/runtime_test.go (1 hunks)
  • pkg/util/testing/client.go (1 hunks)
  • pkg/util/testing/compare.go (1 hunks)
⛔ Files not processed due to max files limit (20)
  • pkg/util/testing/wrapper.go
  • pkg/util/trainjob/trainjob.go
  • pkg/util/trainjob/trainjob_test.go
  • pkg/webhooks/clustertrainingruntime_webhook.go
  • pkg/webhooks/setup.go
  • pkg/webhooks/trainingruntime_webhook.go
  • pkg/webhooks/trainingruntime_webhook_test.go
  • pkg/webhooks/trainjob_webhook.go
  • pkg/webhooks/trainjob_webhook_test.go
  • test/e2e/e2e_test.go
  • test/e2e/suite_test.go
  • test/integration/controller/clustertrainingruntime_controller_test.go
  • test/integration/controller/suite_test.go
  • test/integration/controller/trainingruntime_controller_test.go
  • test/integration/controller/trainjob_controller_test.go
  • test/integration/framework/framework.go
  • test/integration/webhooks/clustertrainingruntime_webhook_test.go
  • test/integration/webhooks/suite_test.go
  • test/integration/webhooks/trainingruntime_webhook_test.go
  • test/integration/webhooks/trainjob_test.go
💤 Files with no reviewable changes (11)
  • pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go
  • Makefile
  • pkg/client/applyconfiguration/trainer/v1alpha1/podspecoverride.go
  • manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml
  • examples/torchtune/llama3_2/alpaca-trainjob-yaml.ipynb
  • cmd/trainers/torchtune/requirements.txt
  • manifests/overlays/kubeflow-platform/kustomization.yaml
  • manifests/base/runtimes/kustomization.yaml
  • CHANGELOG.md
  • manifests/overlays/kubeflow-platform/patches/remove-namespace.yaml
  • manifests/overlays/kubeflow-platform/role_binding.yaml
🧰 Additional context used
🧬 Code graph analysis (10)
pkg/runtime/core/trainingruntime_test.go (1)
pkg/constants/constants.go (3)
  • Node (44-44)
  • TorchTuneArgRdzvEndpoint (129-129)
  • TorchTuneFullFinetuneDistributed (138-138)
pkg/runtime/framework/plugins/torch/torch_test.go (3)
pkg/runtime/runtime.go (1)
  • WithPodSet (138-153)
pkg/constants/constants.go (2)
  • Node (44-44)
  • AncestorTrainer (41-41)
pkg/runtime/framework/interface.go (1)
  • CustomValidationPlugin (34-37)
pkg/runtime/framework/plugins/mpi/mpi_test.go (1)
pkg/runtime/framework/interface.go (1)
  • CustomValidationPlugin (34-37)
pkg/runtime/framework/core/framework.go (2)
pkg/runtime/runtime.go (1)
  • Info (38-49)
pkg/apis/trainer/v1alpha1/trainjob_types.go (1)
  • TrainJob (39-50)
pkg/runtime/framework/plugins/mpi/mpi.go (2)
pkg/runtime/runtime.go (1)
  • Info (38-49)
pkg/apis/trainer/v1alpha1/trainjob_types.go (1)
  • TrainJob (39-50)
pkg/runtime/framework/interface.go (2)
pkg/runtime/runtime.go (1)
  • Info (38-49)
pkg/apis/trainer/v1alpha1/trainjob_types.go (1)
  • TrainJob (39-50)
pkg/runtime/framework/plugins/torch/torch.go (3)
pkg/runtime/runtime.go (2)
  • Info (38-49)
  • PodSet (64-79)
pkg/apis/trainer/v1alpha1/trainjob_types.go (3)
  • TrainJob (39-50)
  • RuntimeRef (137-152)
  • Trainer (203-232)
pkg/constants/constants.go (3)
  • TorchTuneArgRdzvEndpoint (129-129)
  • Node (44-44)
  • ContainerTrainerPort (47-47)
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_pod_spec_override.py (1)
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_container_override.py (1)
  • TrainerV1alpha1ContainerOverride (27-105)
pkg/runtime/framework/plugins/jobset/jobset_test.go (3)
pkg/util/testing/client.go (1)
  • NewClientBuilder (34-45)
pkg/runtime/framework/plugins/jobset/jobset.go (1)
  • New (71-78)
pkg/runtime/framework/interface.go (1)
  • CustomValidationPlugin (34-37)
pkg/runtime/framework/plugins/jobset/jobset.go (2)
pkg/runtime/runtime.go (1)
  • Info (38-49)
pkg/apis/trainer/v1alpha1/trainjob_types.go (1)
  • TrainJob (39-50)
🪛 Checkov (3.2.334)
manifests/base/rbac/role.yaml

[high] 2-79: Minimize ClusterRoles that grant control over validating or mutating admission webhook configurations

(CKV_K8S_155)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: Test
  • GitHub Check: Generate

Comment thread api/python_api/pyproject.toml Outdated
"Intended Audience :: Developers",
"Intended Audience :: Education",
"Intended Audience :: Science/Research",
"Programming Language :: Python :: 3.8",
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Fix Python version support inconsistency.

Python 3.8 has been added to classifiers (line 19), but the AI summary indicates that requires-python = ">=3.9" was removed. This creates a critical inconsistency:

  • The classifiers declare support for Python 3.8, but without a requires-python constraint, users can install the package on unsupported versions.
  • The dependency pydantic>=2.10.0 must be verified to support Python 3.8; if it doesn't, the package will fail at runtime on Python 3.8.

Either:

  1. Restore requires-python = ">=3.9" and remove Python 3.8 from classifiers, or
  2. Add requires-python = ">=3.8" and verify all dependencies (especially pydantic) support Python 3.8.

Also applies to: 31-31


🌐 Web query:

Does pydantic version 2.10.0 or later support Python 3.8?

💡 Result:

Short answer: Yes for 2.10.x, no for later majors — Pydantic 2.10.x supports Python 3.8, but Python 3.8 support was dropped starting in v2.11 (released Mar 27, 2025). [1][2]

Sources:

  • Pydantic changelog (v2.10.2 notes referencing Python 3.8 compatibility). [1]
  • Pydantic v2.11 release announcement (notes that Python 3.8 support was dropped). [2]

🏁 Script executed:

#!/bin/bash
# Extract relevant sections from pyproject.toml
cat api/python_api/pyproject.toml | grep -A 20 "requires-python"
echo "---"
cat api/python_api/pyproject.toml | grep -A 5 "pydantic"
echo "---"
cat api/python_api/pyproject.toml | grep -B 5 -A 5 "3.8"

Length of output: 698


Fix Python version support inconsistency.

Python 3.8 is listed in classifiers but the configuration has critical issues:

  • No requires-python constraint exists, allowing installation on Python 3.8
  • pydantic>=2.10.0 permits v2.11+ installation, which dropped Python 3.8 support (Mar 27, 2025)
  • Result: Python 3.8 users can install the package but will face runtime failures with pydantic 2.11+

Fix by either:

  1. Restore requires-python = ">=3.9" and remove Python 3.8 from classifiers, or
  2. Add requires-python = ">=3.8" and pin pydantic to pydantic>=2.10.0,<2.11

Also applies to: 31-31

🤖 Prompt for AI Agents
In api/python_api/pyproject.toml around lines 19 and 31, the package classifiers
and dependency bounds are inconsistent with pydantic's dropping of Python 3.8
support; add a requires-python constraint and fix pydantic pin to prevent 3.8
users from getting incompatible versions: either set requires-python = ">=3.9"
and remove the "Programming Language :: Python :: 3.8" classifier, or set
requires-python = ">=3.8" and tighten the dependency to pydantic>=2.10.0,<2.11
(apply the same change where the pydantic dependency appears at line 31).

Comment on lines +10011 to +10012
- message: DependsOn can't be set for the first ReplicatedJob
rule: '!(has(self.replicatedJobs[0].dependsOn))'
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

PR lacks explanation and user-facing documentation updates.

This validation rule change affects the allowed configurations for ClusterTrainingRuntime. The PR description is empty, and the documentation checklist item remains unchecked. For a user-facing schema change that restricts valid configurations, both the PR description and user documentation should be updated to explain the rationale and impact.

Provide:

  1. A clear explanation in the PR description of why this restriction is being added
  2. Updates to user-facing documentation explaining the breaking change (if applicable)

Comment on lines +38 to +45
- secrets
verbs:
- get
- list
- watch
- create
- patch
- update
- watch
- patch
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

🧩 Analysis chain

Security review needed: Secrets access added.

The RBAC rules now grant get, list, watch, create, update, and patch permissions on secrets in the core API group. This is a security-sensitive permission expansion.

Please provide justification for why the trainer controller needs to manage secrets. If this is for webhook certificates or similar functionality, consider:

  1. Limiting to a specific namespace if possible
  2. Using more restrictive verbs (e.g., only get if writing isn't required)
  3. Documenting the security implications

Run the following script to check how secrets are used in the controller code:


🏁 Script executed:

#!/bin/bash
# Description: Search for Secret usage in controller code

# Find Secret creation/updates in controller
ast-grep --pattern 'Secret{$$$}' 
rg -n "Kind.*Secret|\.Secrets\(\)" pkg/controller/ pkg/runtime/ cmd/

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Broader search for secret-related code (case-insensitive)
# Search for lowercase references, webhook cert handling, and client patterns

echo "=== Searching for lowercase 'secret' references ==="
rg -in "secret" pkg/controller/ pkg/runtime/ cmd/ | head -30

echo -e "\n=== Searching for webhook/certificate related code ==="
rg -in "webhook|certificate|cert|ca\b" pkg/controller/ pkg/runtime/ cmd/ | head -20

echo -e "\n=== Searching for kubernetes client secret patterns ==="
rg -in "\.secrets\(|secretclient|secretinterface" pkg/controller/ pkg/runtime/ cmd/

echo -e "\n=== Looking for RBAC or permission justification docs ==="
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "CHANGELOG*" -o -name "*RBAC*" \) 2>/dev/null | xargs grep -l "secret\|permission" 2>/dev/null | head -10

echo -e "\n=== Checking clusterrole.yaml for context around secrets rule ==="
head -100 charts/kubeflow-trainer/templates/rbac/clusterrole.yaml

Length of output: 7003



🏁 Script executed:

#!/bin/bash
# Verify how cert.ManageCerts uses the Secret (does it write?)
echo "=== Checking cert.ManageCerts implementation ==="
find . -name "cert.go" -o -name "cert*.go" | xargs grep -l "ManageCerts" 2>/dev/null

echo -e "\n=== Looking at cert package implementation ==="
fd "cert" pkg/util/ -type f | head -5
cat pkg/util/cert/cert.go 2>/dev/null | head -80

echo -e "\n=== Checking if ManageCerts performs write operations ==="
rg -A 20 "func ManageCerts" pkg/util/cert/

Length of output: 3866


Refine Secrets RBAC to match code requirements.

The Secrets permission is justified for webhook certificate management (via pkg/util/cert/cert.go), but the ClusterRole is overly permissive. The kubebuilder:rbac annotation specifies only get;list;watch;update, yet the ClusterRole grants get, list, watch, create, update, patch.

Reduce the Secrets rule to match the code's actual requirements:

- apiGroups:
  - ""
  resources:
  - secrets
  verbs:
  - get
  - list
  - watch
  - update

The create and patch verbs are not used by the cert-controller library and should be removed.

🤖 Prompt for AI Agents
In charts/kubeflow-trainer/templates/rbac/clusterrole.yaml around lines 38 to
45, the ClusterRole grants secrets verbs "create" and "patch" in addition to
"get,list,watch,update" which is more permissive than the kubebuilder annotation
and the cert controller uses; remove the create and patch verbs so the secrets
rule only contains get, list, watch, and update to match the code requirements
and kubebuilder:rbac annotation.

Comment thread docs/release/changelog.py Outdated
github_repo = Github(args.token).get_repo(REPO_NAME)
comparison = github_repo.compare(previous_release, current_release)
commits = list(comparison.commits)
commits = comparison.commits
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify that comparison.commits supports indexing and reversing.

The removal of the explicit list() conversion may cause runtime errors. The code accesses commits[-1] (Line 29) and uses reversed(commits) (Line 35), which require the object to support negative indexing and reversal. PyGithub's Comparison.commits returns a PaginatedList, which may not support these operations without explicit conversion to a list.

Run the following script to verify the type and capabilities of comparison.commits:

Alternatively, search the web for PyGithub PaginatedList documentation:


🏁 Script executed:

#!/bin/bash
# Description: Check PyGithub documentation for PaginatedList capabilities

# Search for PaginatedList implementation to verify indexing support
rg -n "class PaginatedList" --type py -A 20

Length of output: 0


🌐 Web query:

Does PyGithub PaginatedList support negative indexing and reversed() function?

💡 Result:

Short answer:

  • Negative indexing: No — PaginatedList does not support Python-style negative indexes.
  • reversed(): Not via the built-in reversed() — use the PaginatedList.reversed attribute to iterate in reverse.

Sources: PyGithub docs showing indexing/slicing and the .reversed attribute, and changelog noting reversed iteration support. [1][2][3]

References
[1] PyGithub utilities (PaginatedList) — docs showing indexing/slicing and .reversed. (pygithub.readthedocs.io)
[2] Older PyGithub utilities docs (same behavior). (pygithub.readthedocs.io)
[3] Changelog noting addition of reversed iteration. (pygithub.readthedocs.io)

Citations:


🏁 Script executed:

#!/bin/bash
# Verify the actual operations at lines 29 and 35 in the file
sed -n '25,40p' docs/release/changelog.py

Length of output: 589


Restore list() conversion — code will crash on negative indexing and reversed().

PyGithub's PaginatedList does not support negative indexing or the built-in reversed() function. The code accesses commits[-1] at line 29 and calls reversed(commits) at line 35. Removing the list() conversion will cause runtime errors.

Restore line 26 to: commits = list(comparison.commits)

🤖 Prompt for AI Agents
In docs/release/changelog.py around line 26, the assignment removed the list
conversion so commits = comparison.commits returns a PyGithub PaginatedList
which does not support negative indexing or reversed(); restore the conversion
by making commits a concrete list (i.e., reintroduce list(...) around
comparison.commits) so subsequent uses like commits[-1] and reversed(commits)
work without runtime errors.

Comment on lines +51 to +52
"!pip install torch==2.5.0\n",
"!pip install torchvision==0.20.0"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify PyTorch version compatibility and security.

The pinned versions appear reasonable, but please confirm that torch 2.5.0 and torchvision 0.20.0 are compatible, currently supported, and free from known vulnerabilities.


🌐 Web query:

PyTorch 2.5.0 and torchvision 0.20.0 compatibility

💡 Result:

Yes — torchvision 0.20.0 is the supported companion for PyTorch 2.5.0. The official compatibility/install docs show torch==2.5.0 with torchvision==0.20.0 (conda and pip install commands are provided). [1][2]

References:

  • PyTorch previous versions / install instructions (lists torch==2.5.0 + torchvision==0.20.0). [1]
  • PyTorch versions compatibility matrix (shows 2.5.0 ↔ 0.20.0). [2]

[1] https://pytorch.org/get-started/previous-versions/
[2] https://github.com/pytorch/pytorch/wiki/PyTorch-Versions


🌐 Web query:

PyTorch 2.5.0 security vulnerabilities CVE

💡 Result:

Short answer: Yes — PyTorch 2.5.0 is affected by a critical RCE CVE: CVE-2025-32434 (torch.load with weights_only=True). It affects versions before 2.6.0 (i.e., 2.5.1 and earlier, including 2.5.0); the fix was released in 2.6.0. [1][2]

References

Would you like upgrade/mitigation steps or a short detection checklist?


Upgrade PyTorch to patch critical security vulnerability.

While torch 2.5.0 and torchvision 0.20.0 are verified as officially compatible, PyTorch 2.5.0 is affected by a critical RCE CVE (CVE-2025-32434) in torch.load with weights_only=True, fixed in version 2.6.0. Upgrade to torch 2.6.0+ (with torchvision 0.21.0+) to patch this vulnerability.

🤖 Prompt for AI Agents
In examples/pytorch/image-classification/mnist.ipynb around lines 51-52, the
notebook pins torch==2.5.0 and torchvision==0.20.0 which is vulnerable
(CVE-2025-32434); update the pip install lines to install torch>=2.6.0 and
torchvision>=0.21.0 (e.g., torch==2.6.0 and torchvision==0.21.0 or use >= pins),
run a quick compatibility test (import torch, torchvision, and load a model) to
confirm nothing breaks, and update any related docs/requirements to reflect the
new minimum versions.

Comment on lines +8 to +13
- name: ghcr.io/kubeflow/trainer/torchtune-trainer
newTag: v2.0.0
newTag: latest
- name: ghcr.io/kubeflow/trainer/dataset-initializer
newTag: v2.0.0
newTag: latest
- name: ghcr.io/kubeflow/trainer/model-initializer
newTag: v2.0.0
- name: ghcr.io/kubeflow/trainer/mlx-runtime
newTag: v2.0.0
- name: ghcr.io/kubeflow/trainer/deepspeed-runtime
newTag: v2.0.0
newTag: latest
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Reconsider using latest image tags for production trainer runtimes.

Same concern as the controller manager: using latest without version pinning introduces reproducibility and security risks for production deployments. Replace with explicit semantic versions.

🤖 Prompt for AI Agents
In manifests/overlays/runtimes/kustomization.yaml around lines 8 to 13, the
image patches use the floating tag "latest" for trainer/runtime images which is
unsafe for production; replace each "newTag: latest" with explicit semantic
version tags (e.g., vX.Y.Z) for ghcr.io/kubeflow/trainer/torchtune-trainer,
ghcr.io/kubeflow/trainer/dataset-initializer, and
ghcr.io/kubeflow/trainer/model-initializer, ensuring you pick the
tested/released versions used in your CI or release notes and update any
deployment docs or CI that relied on "latest".

Annotations map[string]string `json:"annotations,omitempty"`

// Custom overrides for the training runtime.
// +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Note: PodSpecOverrides is now immutable.

The addition of the XValidation rule makes PodSpecOverrides immutable after creation. This is a breaking change that prevents users from modifying pod spec overrides on existing TrainJobs. Ensure this is the intended behavior and that users are properly notified through release notes or migration guides.

}

// +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch;update;patch
// +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: RBAC verbs don't match controller operations.

The RBAC annotation removes update and patch verbs, but the controller calls r.client.Update() on lines 85 and 88 to add or remove finalizers. Without these permissions, the controller will fail at runtime with permission denied errors.

Apply this diff to restore the required RBAC verbs:

-// +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch
+// +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch;update;patch
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch
// +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch;update;patch
🤖 Prompt for AI Agents
In pkg/controller/clustertrainingruntime_controller.go around line 65 the RBAC
kubebuilder annotation only grants get;list;watch but the controller calls
r.client.Update() on lines ~85-88 to add/remove finalizers; update the
annotation to include update and patch (e.g., verbs=get;list;watch;update;patch)
so the controller has permission to update ClusterTrainingRuntime objects, then
regenerate manifests if your project uses controller-gen.

Comment on lines 199 to 211
newCommand = append(newCommand,
fmt.Sprintf("%s %s-%s-0-0.%s:%d",
constants.TorchTuneArgRdzvEndpoint,
trainJob.Name, constants.Node, trainJob.Name, constants.ContainerTrainerPort,
),
)

// 2. Get the recipe and config from old args and append them to newCommand.
recipe, config := getRecipeAndConfig(
numNodes,
numProcPerNode,
getModelFromRuntimeRef(trainJob.Spec.RuntimeRef.Name),
trainJob.Spec.Trainer.Args,
)
newCommand = append(newCommand, recipe, constants.TorchTuneArgConfig, config)
numNodes := ptr.Deref(ptr.Deref(trainerPS, runtime.PodSet{}).Count, 1)
model := getModelFromRuntimeRef(trainJob.Spec.RuntimeRef.Name)
recipe, config := getRecipeAndConfig(numNodes, numProcPerNode, model, trainJob.Spec.Trainer.Args)
newCommand = append(newCommand, recipe, fmt.Sprintf("--config %s", config))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Split rendezvous/config flags into separate argv entries. fmt.Sprintf("%s %s-...") and fmt.Sprintf("--config %s", config) each collapse a flag-plus-argument into one token, so torchtune never sees the expected separate --rdzv_endpoint and --config parameters. The command will fail to start rendezvous. Please append flags and values as distinct arguments.

            newCommand = append(newCommand,
-               fmt.Sprintf("%s %s-%s-0-0.%s:%d",
-                   constants.TorchTuneArgRdzvEndpoint,
-                   trainJob.Name, constants.Node, trainJob.Name, constants.ContainerTrainerPort,
-               ),
-           )
+               constants.TorchTuneArgRdzvEndpoint,
+               fmt.Sprintf("%s-%s-0-0.%s:%d",
+                   trainJob.Name, constants.Node, trainJob.Name, constants.ContainerTrainerPort,
+               ),
+           )
@@
-           newCommand = append(newCommand, recipe, fmt.Sprintf("--config %s", config))
+           newCommand = append(newCommand, recipe, "--config", config)

@sutaakar sutaakar closed this Nov 11, 2025
@sutaakar sutaakar deleted the odh-rebase branch November 11, 2025 08:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.