Conversation
* fix(docs): convert commits to list in changelog.py for compatibility Signed-off-by: kramaranya <kramaranya15@gmail.com> * chore(docs): add Changelog for Trainer v2.0.0-rc.0 Signed-off-by: kramaranya <kramaranya15@gmail.com> --------- Signed-off-by: kramaranya <kramaranya15@gmail.com>
…nShift (kubeflow#2682) Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
…#2685) * chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update cuda to 12.8 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…w#2382) * Add the manifests overlay for Kubeflow Training V2 Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Signed-off-by: Xinmin Du <2812493086@qq.com> * Update manifest: adjust permissions, and format changes Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Signed-off-by: Xinmin Du <2812493086@qq.com> * Update manifest: rename overlay, adjust event permissions Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Signed-off-by: Xinmin Du <2812493086@qq.com> * Update manifest: make namespace configurable Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Signed-off-by: Xinmin Du <2812493086@qq.com> * Update manifest: move standalone, only-manager installation in namespace: kubeflow-system Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Signed-off-by: Xinmin Du <2812493086@qq.com> * Update manifest: add overlay for Kubeflow Platform installation Signed-off-by: Xinmin Du <2812493086@qq.com> * add permission for pods log read & rm persistentvolumeclaims Signed-off-by: Xinmin Du <2812493086@qq.com> * create the runtimes before the webhooks Signed-off-by: Xinmin Du <2812493086@qq.com> * Specify sorting order: fifo Signed-off-by: Xinmin Du <2812493086@qq.com> * Deploy jobset first Signed-off-by: Xinmin Du <2812493086@qq.com> * remove edit permissions to runtimes; install runtimes after crds Signed-off-by: Xinmin Du <2812493086@qq.com> * remove pretraining directory Signed-off-by: Xinmin Du <2812493086@qq.com> * patch runtimes images Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: correct image Signed-off-by: Xinmin Du <2812493086@qq.com> * add image patch for more runtimes Signed-off-by: Xinmin Du <2812493086@qq.com> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * role_bind for notebook & profile Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: reorder images Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: reuse overlay/manager & runtimes Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: remove namespace with patch Signed-off-by: Xinmin Du <2812493086@qq.com> --------- Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Signed-off-by: Xinmin Du <2812493086@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com> Co-authored-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…ith CTR and TrainJob yaml files (kubeflow#2669) * chore(mainfests): include torchtune runtimes. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(manifests): Update torchtune runtimes.: Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(manifests): Update mounting path in CTRs. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(manifests): Update output_dir. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(manifests): Update numProcPerNode to auto. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com>
…w#2675) * fix(plugins): fix errors in trainer command mutation of torchtune. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(plugins): remove config file format suffix. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(test): update UTs. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(initializer): Update the workspace of dataset/model initializer. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(plugins): parse nproc_per_node from GPU resource. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(torchtune): Add bitsandbytes dependency in requirements.txt Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(lint): fix lint error. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(torchtune): Remove unnecessary num_proc_per_node calculation. Signed-off-by: Electronic-Waste <2690692950@qq.com> * test(torch): Update invalid parameters. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
…ubeflow#2695) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* feat: Mutable PodSpecOverrides for suspended TrainJob Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Include @tenzen-y review Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Add unit tests Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* feat(example): Add alpaca-trianjob-yaml.ipynb. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(example): Update the overview of the torchtune llama3_2 example. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(example): Update the pvc description. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(example): Add the get the fine-tuned model section. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(example): Fix some errors. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(example): fix some errors. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(manifests): Fix debug tag. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(example): Change PVC creation method to Python SDK. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(example): Remove config load. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* feat: Add schedulingGates to PodSpecOverrides Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Change desired job to target job in PodSpecOverrides comments Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* fix(module): Change Go module name to v2 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Bump x/net to v0.38.0 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* chore(docs): Add Changelog for v2.0.0-rc.1 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Move example to misc Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add Red Hat to ADOPTERS.md Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * Update ADOPTERS.md Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
…d to job (kubeflow#2719) Signed-off-by: rudeigerc <rudeigerc@gmail.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…ow#2731) Signed-off-by: rudeigerc <rudeigerc@gmail.com>
* chore(ci): Add GitHub action to verify PR titles Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use operator scope Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add examples scope Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add scripts to scope Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add exporter Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * add wip ignore label Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add PR title to the contrib guide Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Ignore dependencies label Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix text Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use action only on master branch Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…ue template (kubeflow#2732) Signed-off-by: rudeigerc <rudeigerc@gmail.com>
… jobset (kubeflow#2734) Signed-off-by: rudeigerc <rudeigerc@gmail.com>
Signed-off-by: Koray Oksay <koray.oksay@gmail.com>
* chore(docs): Add Changelog for Kubeflow Trainer v2.0.0 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add links for blog post and migration guide Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add links for blog post and website Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* feat(docs): Kubeflow Trainer ROADMAP 2025 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update roadmap Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add issue for Trainer UI Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add issues for MPI and plugin extension Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add issues for builtin trainers Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
…ubeflow#2754) Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* chore(ci): Enable Kubernetes API Linter Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * lint files Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Generate Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Review Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Fix golangci-lint target Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Fix commands max length Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Re: Fix golangci-lint target Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Enable the integers linter Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Disable maxlength linter Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * make generate Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Make fields optional Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Fix Suspended field in JobStatus Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* fix(api): Keep mpiImplementation field a pointer Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Fix integration test Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
…s for data_cache (kubeflow#2890) Signed-off-by: Akshay Chitneni <achitneni@apple.com> Co-authored-by: Akshay Chitneni <achitneni@apple.com>
kubeflow#2898) Signed-off-by: Xinmin Du <2812493086@qq.com>
…obs (kubeflow#2653) * feat(runtime): add support for launcher resource allocation in MPI jobs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add unit tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Set numProcPerNode for MPI plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Move util func to runtime package Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix torchtune plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Inline if for GPU check Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Assign container resources once Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add todo for test wrappers Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…obs (kubeflow#2722) * feat(webhook): Add validation for required containers in replicatedJobs. Signed-off-by: Electronic-Waste <2690692950@qq.com> * test(webhook): Add UTs for validation in required containers. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(lint): fix lint error. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(webhook): add global map & remove launcher check. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com>
* feat(manager): add controller manager configuration and configmap support Signed-off-by: kapil27 <knema@redhat.com> * refactor: update configmap naming and leader election configuration Signed-off-by: kapil27 <knema@redhat.com> * chore: clean up unused lines in configmap and test files Signed-off-by: kapil27 <knema@redhat.com> --------- Signed-off-by: kapil27 <knema@redhat.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
kubeflow#2911) * feat(initializer): add s3 model and dataset initializers Signed-off-by: rudeigerc <rudeigerc@gmail.com> * chore: refactor with opendal Signed-off-by: rudeigerc <rudeigerc@gmail.com> * chore: support `role_arn` and add `ignore_patterns` field in the Initializers configs Signed-off-by: rudeigerc <rudeigerc@gmail.com> --------- Signed-off-by: rudeigerc <rudeigerc@gmail.com> Co-authored-by: rudeigerc <rudeigerc@gmail.com>
…ubeflow#2912) * chore(operator): Use SSA throughout runtime framework Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Fix lint error Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Update go.mod file Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Co-authored-by: Antonin Stefanutti <antonin@stefanutti.fr>
…harts (kubeflow#2914) Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Co-authored-by: Antonin Stefanutti <antonin@stefanutti.fr>
…branch (kubeflow#2917) * feat(manifests): Publish Trainer Helm Charts (kubeflow#2906) * Solve Remaining Error and bugs Signed-off-by: adity1raut <araut7798@gmail.com> * Solve the confige Signed-off-by: adity1raut <araut7798@gmail.com> * Update The Suggest Change Signed-off-by: adity1raut <araut7798@gmail.com> * Update After REview Signed-off-by: adity1raut <araut7798@gmail.com> * Update the Helm publish action Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update release doc Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use 0.0.0 version for master branch Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update release doc Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: adity1raut <araut7798@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * fix(manifests): Fix Helm charts image name (kubeflow#2915) * fix(manifests): Fix Helm charts image name Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Always insert appVersion to the Chart.yaml file Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix comment Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Simplify action Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * fix(manifests): Remove the default tag from the controller image (kubeflow#2916) * fix(manifests): Remove the default tag from the controller image Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix README template Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: adity1raut <araut7798@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Aditya Raut <159172287+adity1raut@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…cache nodes (kubeflow#2920) Signed-off-by: Akshay Chitneni <achitneni@apple.com> Co-authored-by: Akshay Chitneni <achitneni@apple.com>
…#2924) * add local docker training example Signed-off-by: Brian Gallagher <briangal@gmail.com> * feat: Adding local execution example notebook Co-authored-by Brian Gallagher <bgallagh@redhat.com> Signed-off-by: Fiona Waters <fiwaters6@gmail.com> --------- Signed-off-by: Brian Gallagher <briangal@gmail.com> Signed-off-by: Fiona Waters <fiwaters6@gmail.com> Co-authored-by: Brian Gallagher <briangal@gmail.com> Co-authored-by: Fiona Waters <fiwaters6@gmail.com>
…ubeflow#2927) * fix(ci): Fix the Kubeflow SDK installation with Docker Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Uncomment delete job in local Notebooks Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update .github/workflows/test-e2e.yaml Co-authored-by: Anya Kramar <akramar@redhat.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Anya Kramar <akramar@redhat.com>
…e and example (kubeflow#2928) Signed-off-by: Akshay Chitneni <achitneni@apple.com> Co-authored-by: Akshay Chitneni <achitneni@apple.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
WalkthroughThis pull request performs a major module path migration from Changes
Sequence Diagram(s)sequenceDiagram
participant User as User/Webhook
participant Framework as Framework
participant Plugin as Validation Plugin
participant Validator as Kubernetes Validator
rect rgb(200, 220, 255)
Note over Framework,Plugin: Old Flow (with context)
User->>Framework: Validate(ctx, info, oldObj, newObj)
Framework->>Plugin: RunCustomValidationPlugins(ctx, info, oldObj, newObj)
Plugin->>Plugin: Validate(ctx, info, oldObj, newObj)
Plugin-->>Framework: warnings, errors
end
rect rgb(220, 240, 200)
Note over Framework,Plugin: New Flow (context removed)
User->>Framework: Validate(info, oldObj, newObj)
Framework->>Plugin: RunCustomValidationPlugins(info, oldObj, newObj)
Plugin->>Plugin: Validate(info, oldObj, newObj)
Plugin-->>Framework: warnings, errors
end
rect rgb(240, 220, 200)
Note over User,Validator: CRD Validation Enhancement
Validator->>Validator: Check dependsOn on replicatedJobs[0]
alt First job has dependsOn
Validator-->>User: ✗ Reject (DependsOn can't be set for first ReplicatedJob)
else Only subsequent jobs have dependsOn
Validator-->>User: ✓ Accept
end
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
.golangci.yaml (1)
1-18: Missing PR description: Cannot properly review this migration without context.This PR appears to perform a significant module path migration (based on the summary and the new
prefix(github.com/kubeflow/trainer)section), yet the PR description is completely empty—only a template stub remains. Without a clear explanation of what this PR does and why, reviewers cannot assess:
- The scope and scale of the migration
- The rationale for reverting commits (as the title suggests)
- Breaking changes or migration impacts
- Whether all dependent code has been updated consistently
This is a blocker for review and merge.
Please fill in the PR description with:
- What this PR does (module path migration details)
- Why it's needed
- Any breaking changes
- Migration guidance for users/developers
- Issues being fixed (reference the tracking issue)
pkg/controller/trainjob_controller.go (1)
94-144: Restore events RBAC to keep EventRecorder working.Dropping the kubebuilder RBAC rule for core/v1 events leaves the controller service account without permission to create/update/patch events, so every
r.recorder.Event(...)call will start failing withForbiddenonce deployed. Please add the events rule back.// +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=trainjobs/status,verbs=get;update;patch // +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=trainjobs/finalizers,verbs=get;update;patch +// +kubebuilder:rbac:groups="",resources=events,verbs=create;watch;update;patchcharts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml (1)
1094-1099: Fix unclear description for nodeSelector override.Line 1098 description states: "place Pod on the specific mode" which is semantically unclear. This should be "place Pod on specific nodes" to accurately describe the nodeSelector purpose.
- description: Override for the node selector to place Pod on - the specific mode. + description: Override for the node selector to place Pod on specific nodes.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (107)
.github/workflows/build-and-push-images.yaml(1 hunks).github/workflows/test-e2e.yaml(1 hunks).gitignore(1 hunks).golangci.yaml(1 hunks)CHANGELOG.md(0 hunks)Makefile(0 hunks)api/openapi-spec/swagger.json(2 hunks)api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_pod_spec_override.py(1 hunks)api/python_api/pyproject.toml(1 hunks)charts/kubeflow-trainer/Chart.yaml(1 hunks)charts/kubeflow-trainer/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml(1 hunks)charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainingruntimes.yaml(1 hunks)charts/kubeflow-trainer/crds/trainer.kubeflow.org_trainjobs.yaml(11 hunks)charts/kubeflow-trainer/templates/rbac/clusterrole.yaml(1 hunks)cmd/initializers/dataset/Dockerfile(1 hunks)cmd/initializers/model/Dockerfile(1 hunks)cmd/runtimes/deepspeed/Dockerfile(1 hunks)cmd/runtimes/deepspeed/requirements.txt(1 hunks)cmd/trainer-controller-manager/main.go(1 hunks)cmd/trainers/torchtune/Dockerfile(1 hunks)cmd/trainers/torchtune/requirements.txt(0 hunks)docs/release/changelog.py(1 hunks)examples/pytorch/image-classification/mnist.ipynb(4 hunks)examples/torchtune/llama3_2/alpaca-trainjob-yaml.ipynb(0 hunks)go.mod(2 hunks)hack/e2e-setup-cluster.sh(2 hunks)hack/swagger/main.go(2 hunks)hack/update-codegen.sh(1 hunks)manifests/base/crds/trainer.kubeflow.org_clustertrainingruntimes.yaml(1 hunks)manifests/base/crds/trainer.kubeflow.org_trainingruntimes.yaml(1 hunks)manifests/base/crds/trainer.kubeflow.org_trainjobs.yaml(3 hunks)manifests/base/rbac/role.yaml(1 hunks)manifests/base/runtimes/kustomization.yaml(0 hunks)manifests/base/runtimes/torch_distributed.yaml(1 hunks)manifests/base/runtimes/torchtune/llama3_2/llama3_2_1B.yaml(4 hunks)manifests/base/runtimes/torchtune/llama3_2/llama3_2_3B.yaml(4 hunks)manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml(0 hunks)manifests/overlays/kubeflow-platform/kustomization.yaml(0 hunks)manifests/overlays/kubeflow-platform/patches/remove-namespace.yaml(0 hunks)manifests/overlays/kubeflow-platform/role_binding.yaml(0 hunks)manifests/overlays/manager/kustomization.yaml(1 hunks)manifests/overlays/runtimes/kustomization.yaml(1 hunks)manifests/third-party/jobset/kustomization.yaml(1 hunks)pkg/apis/trainer/v1alpha1/trainjob_types.go(2 hunks)pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go(0 hunks)pkg/client/applyconfiguration/trainer/v1alpha1/mpimlpolicysource.go(1 hunks)pkg/client/applyconfiguration/trainer/v1alpha1/podspecoverride.go(0 hunks)pkg/client/applyconfiguration/utils.go(1 hunks)pkg/client/clientset/versioned/clientset.go(1 hunks)pkg/client/clientset/versioned/fake/clientset_generated.go(1 hunks)pkg/client/clientset/versioned/fake/register.go(1 hunks)pkg/client/clientset/versioned/scheme/register.go(1 hunks)pkg/client/clientset/versioned/typed/trainer/v1alpha1/clustertrainingruntime.go(1 hunks)pkg/client/clientset/versioned/typed/trainer/v1alpha1/fake/fake_clustertrainingruntime.go(1 hunks)pkg/client/clientset/versioned/typed/trainer/v1alpha1/fake/fake_trainer_client.go(1 hunks)pkg/client/clientset/versioned/typed/trainer/v1alpha1/fake/fake_trainingruntime.go(1 hunks)pkg/client/clientset/versioned/typed/trainer/v1alpha1/fake/fake_trainjob.go(1 hunks)pkg/client/clientset/versioned/typed/trainer/v1alpha1/trainer_client.go(1 hunks)pkg/client/clientset/versioned/typed/trainer/v1alpha1/trainingruntime.go(1 hunks)pkg/client/clientset/versioned/typed/trainer/v1alpha1/trainjob.go(1 hunks)pkg/client/informers/externalversions/factory.go(1 hunks)pkg/client/informers/externalversions/generic.go(1 hunks)pkg/client/informers/externalversions/internalinterfaces/factory_interfaces.go(1 hunks)pkg/client/informers/externalversions/trainer/interface.go(1 hunks)pkg/client/informers/externalversions/trainer/v1alpha1/clustertrainingruntime.go(1 hunks)pkg/client/informers/externalversions/trainer/v1alpha1/interface.go(1 hunks)pkg/client/informers/externalversions/trainer/v1alpha1/trainingruntime.go(1 hunks)pkg/client/informers/externalversions/trainer/v1alpha1/trainjob.go(1 hunks)pkg/client/listers/trainer/v1alpha1/clustertrainingruntime.go(1 hunks)pkg/client/listers/trainer/v1alpha1/trainingruntime.go(1 hunks)pkg/client/listers/trainer/v1alpha1/trainjob.go(1 hunks)pkg/constants/constants.go(1 hunks)pkg/controller/clustertrainingruntime_controller.go(2 hunks)pkg/controller/clustertrainingruntime_controller_test.go(1 hunks)pkg/controller/setup.go(1 hunks)pkg/controller/trainingruntime_controller.go(2 hunks)pkg/controller/trainingruntime_controller_test.go(1 hunks)pkg/controller/trainjob_controller.go(1 hunks)pkg/runtime/core/clustertrainingruntime.go(2 hunks)pkg/runtime/core/clustertrainingruntime_test.go(1 hunks)pkg/runtime/core/core.go(1 hunks)pkg/runtime/core/registry.go(1 hunks)pkg/runtime/core/trainingruntime.go(2 hunks)pkg/runtime/core/trainingruntime_test.go(6 hunks)pkg/runtime/framework/core/framework.go(2 hunks)pkg/runtime/framework/core/framework_test.go(2 hunks)pkg/runtime/framework/interface.go(2 hunks)pkg/runtime/framework/plugins/coscheduling/coscheduling.go(1 hunks)pkg/runtime/framework/plugins/coscheduling/indexer.go(1 hunks)pkg/runtime/framework/plugins/coscheduling/indexer_test.go(1 hunks)pkg/runtime/framework/plugins/jobset/builder.go(1 hunks)pkg/runtime/framework/plugins/jobset/jobset.go(2 hunks)pkg/runtime/framework/plugins/jobset/jobset_test.go(2 hunks)pkg/runtime/framework/plugins/mpi/mpi.go(2 hunks)pkg/runtime/framework/plugins/mpi/mpi_test.go(2 hunks)pkg/runtime/framework/plugins/plainml/plainml.go(1 hunks)pkg/runtime/framework/plugins/plainml/plainml_test.go(1 hunks)pkg/runtime/framework/plugins/registry.go(1 hunks)pkg/runtime/framework/plugins/torch/torch.go(4 hunks)pkg/runtime/framework/plugins/torch/torch_test.go(11 hunks)pkg/runtime/indexer/indexer.go(1 hunks)pkg/runtime/indexer/indexer_test.go(1 hunks)pkg/runtime/interface.go(1 hunks)pkg/runtime/runtime.go(1 hunks)pkg/runtime/runtime_test.go(1 hunks)pkg/util/testing/client.go(1 hunks)pkg/util/testing/compare.go(1 hunks)
⛔ Files not processed due to max files limit (20)
- pkg/util/testing/wrapper.go
- pkg/util/trainjob/trainjob.go
- pkg/util/trainjob/trainjob_test.go
- pkg/webhooks/clustertrainingruntime_webhook.go
- pkg/webhooks/setup.go
- pkg/webhooks/trainingruntime_webhook.go
- pkg/webhooks/trainingruntime_webhook_test.go
- pkg/webhooks/trainjob_webhook.go
- pkg/webhooks/trainjob_webhook_test.go
- test/e2e/e2e_test.go
- test/e2e/suite_test.go
- test/integration/controller/clustertrainingruntime_controller_test.go
- test/integration/controller/suite_test.go
- test/integration/controller/trainingruntime_controller_test.go
- test/integration/controller/trainjob_controller_test.go
- test/integration/framework/framework.go
- test/integration/webhooks/clustertrainingruntime_webhook_test.go
- test/integration/webhooks/suite_test.go
- test/integration/webhooks/trainingruntime_webhook_test.go
- test/integration/webhooks/trainjob_test.go
💤 Files with no reviewable changes (11)
- pkg/apis/trainer/v1alpha1/zz_generated.deepcopy.go
- Makefile
- pkg/client/applyconfiguration/trainer/v1alpha1/podspecoverride.go
- manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml
- examples/torchtune/llama3_2/alpaca-trainjob-yaml.ipynb
- cmd/trainers/torchtune/requirements.txt
- manifests/overlays/kubeflow-platform/kustomization.yaml
- manifests/base/runtimes/kustomization.yaml
- CHANGELOG.md
- manifests/overlays/kubeflow-platform/patches/remove-namespace.yaml
- manifests/overlays/kubeflow-platform/role_binding.yaml
🧰 Additional context used
🧬 Code graph analysis (10)
pkg/runtime/core/trainingruntime_test.go (1)
pkg/constants/constants.go (3)
Node(44-44)TorchTuneArgRdzvEndpoint(129-129)TorchTuneFullFinetuneDistributed(138-138)
pkg/runtime/framework/plugins/torch/torch_test.go (3)
pkg/runtime/runtime.go (1)
WithPodSet(138-153)pkg/constants/constants.go (2)
Node(44-44)AncestorTrainer(41-41)pkg/runtime/framework/interface.go (1)
CustomValidationPlugin(34-37)
pkg/runtime/framework/plugins/mpi/mpi_test.go (1)
pkg/runtime/framework/interface.go (1)
CustomValidationPlugin(34-37)
pkg/runtime/framework/core/framework.go (2)
pkg/runtime/runtime.go (1)
Info(38-49)pkg/apis/trainer/v1alpha1/trainjob_types.go (1)
TrainJob(39-50)
pkg/runtime/framework/plugins/mpi/mpi.go (2)
pkg/runtime/runtime.go (1)
Info(38-49)pkg/apis/trainer/v1alpha1/trainjob_types.go (1)
TrainJob(39-50)
pkg/runtime/framework/interface.go (2)
pkg/runtime/runtime.go (1)
Info(38-49)pkg/apis/trainer/v1alpha1/trainjob_types.go (1)
TrainJob(39-50)
pkg/runtime/framework/plugins/torch/torch.go (3)
pkg/runtime/runtime.go (2)
Info(38-49)PodSet(64-79)pkg/apis/trainer/v1alpha1/trainjob_types.go (3)
TrainJob(39-50)RuntimeRef(137-152)Trainer(203-232)pkg/constants/constants.go (3)
TorchTuneArgRdzvEndpoint(129-129)Node(44-44)ContainerTrainerPort(47-47)
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_pod_spec_override.py (1)
api/python_api/kubeflow_trainer_api/models/trainer_v1alpha1_container_override.py (1)
TrainerV1alpha1ContainerOverride(27-105)
pkg/runtime/framework/plugins/jobset/jobset_test.go (3)
pkg/util/testing/client.go (1)
NewClientBuilder(34-45)pkg/runtime/framework/plugins/jobset/jobset.go (1)
New(71-78)pkg/runtime/framework/interface.go (1)
CustomValidationPlugin(34-37)
pkg/runtime/framework/plugins/jobset/jobset.go (2)
pkg/runtime/runtime.go (1)
Info(38-49)pkg/apis/trainer/v1alpha1/trainjob_types.go (1)
TrainJob(39-50)
🪛 Checkov (3.2.334)
manifests/base/rbac/role.yaml
[high] 2-79: Minimize ClusterRoles that grant control over validating or mutating admission webhook configurations
(CKV_K8S_155)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: Test
- GitHub Check: Generate
| "Intended Audience :: Developers", | ||
| "Intended Audience :: Education", | ||
| "Intended Audience :: Science/Research", | ||
| "Programming Language :: Python :: 3.8", |
There was a problem hiding this comment.
🧩 Analysis chain
Fix Python version support inconsistency.
Python 3.8 has been added to classifiers (line 19), but the AI summary indicates that requires-python = ">=3.9" was removed. This creates a critical inconsistency:
- The classifiers declare support for Python 3.8, but without a
requires-pythonconstraint, users can install the package on unsupported versions. - The dependency
pydantic>=2.10.0must be verified to support Python 3.8; if it doesn't, the package will fail at runtime on Python 3.8.
Either:
- Restore
requires-python = ">=3.9"and remove Python 3.8 from classifiers, or - Add
requires-python = ">=3.8"and verify all dependencies (especially pydantic) support Python 3.8.
Also applies to: 31-31
🌐 Web query:
Does pydantic version 2.10.0 or later support Python 3.8?
💡 Result:
Short answer: Yes for 2.10.x, no for later majors — Pydantic 2.10.x supports Python 3.8, but Python 3.8 support was dropped starting in v2.11 (released Mar 27, 2025). [1][2]
Sources:
- Pydantic changelog (v2.10.2 notes referencing Python 3.8 compatibility). [1]
- Pydantic v2.11 release announcement (notes that Python 3.8 support was dropped). [2]
🏁 Script executed:
#!/bin/bash
# Extract relevant sections from pyproject.toml
cat api/python_api/pyproject.toml | grep -A 20 "requires-python"
echo "---"
cat api/python_api/pyproject.toml | grep -A 5 "pydantic"
echo "---"
cat api/python_api/pyproject.toml | grep -B 5 -A 5 "3.8"Length of output: 698
Fix Python version support inconsistency.
Python 3.8 is listed in classifiers but the configuration has critical issues:
- No
requires-pythonconstraint exists, allowing installation on Python 3.8 pydantic>=2.10.0permits v2.11+ installation, which dropped Python 3.8 support (Mar 27, 2025)- Result: Python 3.8 users can install the package but will face runtime failures with pydantic 2.11+
Fix by either:
- Restore
requires-python = ">=3.9"and remove Python 3.8 from classifiers, or - Add
requires-python = ">=3.8"and pin pydantic topydantic>=2.10.0,<2.11
Also applies to: 31-31
🤖 Prompt for AI Agents
In api/python_api/pyproject.toml around lines 19 and 31, the package classifiers
and dependency bounds are inconsistent with pydantic's dropping of Python 3.8
support; add a requires-python constraint and fix pydantic pin to prevent 3.8
users from getting incompatible versions: either set requires-python = ">=3.9"
and remove the "Programming Language :: Python :: 3.8" classifier, or set
requires-python = ">=3.8" and tighten the dependency to pydantic>=2.10.0,<2.11
(apply the same change where the pydantic dependency appears at line 31).
| - message: DependsOn can't be set for the first ReplicatedJob | ||
| rule: '!(has(self.replicatedJobs[0].dependsOn))' |
There was a problem hiding this comment.
PR lacks explanation and user-facing documentation updates.
This validation rule change affects the allowed configurations for ClusterTrainingRuntime. The PR description is empty, and the documentation checklist item remains unchecked. For a user-facing schema change that restricts valid configurations, both the PR description and user documentation should be updated to explain the rationale and impact.
Provide:
- A clear explanation in the PR description of why this restriction is being added
- Updates to user-facing documentation explaining the breaking change (if applicable)
| - secrets | ||
| verbs: | ||
| - get | ||
| - list | ||
| - watch | ||
| - create | ||
| - patch | ||
| - update | ||
| - watch | ||
| - patch |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major
🧩 Analysis chain
Security review needed: Secrets access added.
The RBAC rules now grant get, list, watch, create, update, and patch permissions on secrets in the core API group. This is a security-sensitive permission expansion.
Please provide justification for why the trainer controller needs to manage secrets. If this is for webhook certificates or similar functionality, consider:
- Limiting to a specific namespace if possible
- Using more restrictive verbs (e.g., only
getif writing isn't required) - Documenting the security implications
Run the following script to check how secrets are used in the controller code:
🏁 Script executed:
#!/bin/bash
# Description: Search for Secret usage in controller code
# Find Secret creation/updates in controller
ast-grep --pattern 'Secret{$$$}'
rg -n "Kind.*Secret|\.Secrets\(\)" pkg/controller/ pkg/runtime/ cmd/Length of output: 0
🏁 Script executed:
#!/bin/bash
# Broader search for secret-related code (case-insensitive)
# Search for lowercase references, webhook cert handling, and client patterns
echo "=== Searching for lowercase 'secret' references ==="
rg -in "secret" pkg/controller/ pkg/runtime/ cmd/ | head -30
echo -e "\n=== Searching for webhook/certificate related code ==="
rg -in "webhook|certificate|cert|ca\b" pkg/controller/ pkg/runtime/ cmd/ | head -20
echo -e "\n=== Searching for kubernetes client secret patterns ==="
rg -in "\.secrets\(|secretclient|secretinterface" pkg/controller/ pkg/runtime/ cmd/
echo -e "\n=== Looking for RBAC or permission justification docs ==="
find . -type f \( -name "*.md" -o -name "*.txt" -o -name "CHANGELOG*" -o -name "*RBAC*" \) 2>/dev/null | xargs grep -l "secret\|permission" 2>/dev/null | head -10
echo -e "\n=== Checking clusterrole.yaml for context around secrets rule ==="
head -100 charts/kubeflow-trainer/templates/rbac/clusterrole.yamlLength of output: 7003
🏁 Script executed:
#!/bin/bash
# Verify how cert.ManageCerts uses the Secret (does it write?)
echo "=== Checking cert.ManageCerts implementation ==="
find . -name "cert.go" -o -name "cert*.go" | xargs grep -l "ManageCerts" 2>/dev/null
echo -e "\n=== Looking at cert package implementation ==="
fd "cert" pkg/util/ -type f | head -5
cat pkg/util/cert/cert.go 2>/dev/null | head -80
echo -e "\n=== Checking if ManageCerts performs write operations ==="
rg -A 20 "func ManageCerts" pkg/util/cert/Length of output: 3866
Refine Secrets RBAC to match code requirements.
The Secrets permission is justified for webhook certificate management (via pkg/util/cert/cert.go), but the ClusterRole is overly permissive. The kubebuilder:rbac annotation specifies only get;list;watch;update, yet the ClusterRole grants get, list, watch, create, update, patch.
Reduce the Secrets rule to match the code's actual requirements:
- apiGroups:
- ""
resources:
- secrets
verbs:
- get
- list
- watch
- updateThe create and patch verbs are not used by the cert-controller library and should be removed.
🤖 Prompt for AI Agents
In charts/kubeflow-trainer/templates/rbac/clusterrole.yaml around lines 38 to
45, the ClusterRole grants secrets verbs "create" and "patch" in addition to
"get,list,watch,update" which is more permissive than the kubebuilder annotation
and the cert controller uses; remove the create and patch verbs so the secrets
rule only contains get, list, watch, and update to match the code requirements
and kubebuilder:rbac annotation.
| github_repo = Github(args.token).get_repo(REPO_NAME) | ||
| comparison = github_repo.compare(previous_release, current_release) | ||
| commits = list(comparison.commits) | ||
| commits = comparison.commits |
There was a problem hiding this comment.
🧩 Analysis chain
Verify that comparison.commits supports indexing and reversing.
The removal of the explicit list() conversion may cause runtime errors. The code accesses commits[-1] (Line 29) and uses reversed(commits) (Line 35), which require the object to support negative indexing and reversal. PyGithub's Comparison.commits returns a PaginatedList, which may not support these operations without explicit conversion to a list.
Run the following script to verify the type and capabilities of comparison.commits:
Alternatively, search the web for PyGithub PaginatedList documentation:
🏁 Script executed:
#!/bin/bash
# Description: Check PyGithub documentation for PaginatedList capabilities
# Search for PaginatedList implementation to verify indexing support
rg -n "class PaginatedList" --type py -A 20Length of output: 0
🌐 Web query:
Does PyGithub PaginatedList support negative indexing and reversed() function?
💡 Result:
Short answer:
- Negative indexing: No — PaginatedList does not support Python-style negative indexes.
- reversed(): Not via the built-in reversed() — use the PaginatedList.reversed attribute to iterate in reverse.
Sources: PyGithub docs showing indexing/slicing and the .reversed attribute, and changelog noting reversed iteration support. [1][2][3]
References
[1] PyGithub utilities (PaginatedList) — docs showing indexing/slicing and .reversed. (pygithub.readthedocs.io)
[2] Older PyGithub utilities docs (same behavior). (pygithub.readthedocs.io)
[3] Changelog noting addition of reversed iteration. (pygithub.readthedocs.io)
Citations:
- 1: https://pygithub.readthedocs.io/en/v2.5.0/utilities.html?utm_source=openai
- 2: https://pygithub.readthedocs.io/en/v2.3.0/utilities.html?utm_source=openai
- 3: https://pygithub.readthedocs.io/en/v2.8.1/changes.html?utm_source=openai
🏁 Script executed:
#!/bin/bash
# Verify the actual operations at lines 29 and 35 in the file
sed -n '25,40p' docs/release/changelog.pyLength of output: 589
Restore list() conversion — code will crash on negative indexing and reversed().
PyGithub's PaginatedList does not support negative indexing or the built-in reversed() function. The code accesses commits[-1] at line 29 and calls reversed(commits) at line 35. Removing the list() conversion will cause runtime errors.
Restore line 26 to: commits = list(comparison.commits)
🤖 Prompt for AI Agents
In docs/release/changelog.py around line 26, the assignment removed the list
conversion so commits = comparison.commits returns a PyGithub PaginatedList
which does not support negative indexing or reversed(); restore the conversion
by making commits a concrete list (i.e., reintroduce list(...) around
comparison.commits) so subsequent uses like commits[-1] and reversed(commits)
work without runtime errors.
| "!pip install torch==2.5.0\n", | ||
| "!pip install torchvision==0.20.0" |
There was a problem hiding this comment.
🧩 Analysis chain
Verify PyTorch version compatibility and security.
The pinned versions appear reasonable, but please confirm that torch 2.5.0 and torchvision 0.20.0 are compatible, currently supported, and free from known vulnerabilities.
🌐 Web query:
PyTorch 2.5.0 and torchvision 0.20.0 compatibility
💡 Result:
Yes — torchvision 0.20.0 is the supported companion for PyTorch 2.5.0. The official compatibility/install docs show torch==2.5.0 with torchvision==0.20.0 (conda and pip install commands are provided). [1][2]
References:
- PyTorch previous versions / install instructions (lists torch==2.5.0 + torchvision==0.20.0). [1]
- PyTorch versions compatibility matrix (shows 2.5.0 ↔ 0.20.0). [2]
[1] https://pytorch.org/get-started/previous-versions/
[2] https://github.com/pytorch/pytorch/wiki/PyTorch-Versions
🌐 Web query:
PyTorch 2.5.0 security vulnerabilities CVE
💡 Result:
Short answer: Yes — PyTorch 2.5.0 is affected by a critical RCE CVE: CVE-2025-32434 (torch.load with weights_only=True). It affects versions before 2.6.0 (i.e., 2.5.1 and earlier, including 2.5.0); the fix was released in 2.6.0. [1][2]
References
- NVD entry for CVE-2025-32434. [1]
- PyTorch security advisory / GitHub advisory (GHSA-53q9-r3pm-6pq6) describing the issue and the 2.6.0 patch. [2]
Would you like upgrade/mitigation steps or a short detection checklist?
Upgrade PyTorch to patch critical security vulnerability.
While torch 2.5.0 and torchvision 0.20.0 are verified as officially compatible, PyTorch 2.5.0 is affected by a critical RCE CVE (CVE-2025-32434) in torch.load with weights_only=True, fixed in version 2.6.0. Upgrade to torch 2.6.0+ (with torchvision 0.21.0+) to patch this vulnerability.
🤖 Prompt for AI Agents
In examples/pytorch/image-classification/mnist.ipynb around lines 51-52, the
notebook pins torch==2.5.0 and torchvision==0.20.0 which is vulnerable
(CVE-2025-32434); update the pip install lines to install torch>=2.6.0 and
torchvision>=0.21.0 (e.g., torch==2.6.0 and torchvision==0.21.0 or use >= pins),
run a quick compatibility test (import torch, torchvision, and load a model) to
confirm nothing breaks, and update any related docs/requirements to reflect the
new minimum versions.
| - name: ghcr.io/kubeflow/trainer/torchtune-trainer | ||
| newTag: v2.0.0 | ||
| newTag: latest | ||
| - name: ghcr.io/kubeflow/trainer/dataset-initializer | ||
| newTag: v2.0.0 | ||
| newTag: latest | ||
| - name: ghcr.io/kubeflow/trainer/model-initializer | ||
| newTag: v2.0.0 | ||
| - name: ghcr.io/kubeflow/trainer/mlx-runtime | ||
| newTag: v2.0.0 | ||
| - name: ghcr.io/kubeflow/trainer/deepspeed-runtime | ||
| newTag: v2.0.0 | ||
| newTag: latest |
There was a problem hiding this comment.
Reconsider using latest image tags for production trainer runtimes.
Same concern as the controller manager: using latest without version pinning introduces reproducibility and security risks for production deployments. Replace with explicit semantic versions.
🤖 Prompt for AI Agents
In manifests/overlays/runtimes/kustomization.yaml around lines 8 to 13, the
image patches use the floating tag "latest" for trainer/runtime images which is
unsafe for production; replace each "newTag: latest" with explicit semantic
version tags (e.g., vX.Y.Z) for ghcr.io/kubeflow/trainer/torchtune-trainer,
ghcr.io/kubeflow/trainer/dataset-initializer, and
ghcr.io/kubeflow/trainer/model-initializer, ensuring you pick the
tested/released versions used in your CI or release notes and update any
deployment docs or CI that relied on "latest".
| Annotations map[string]string `json:"annotations,omitempty"` | ||
|
|
||
| // Custom overrides for the training runtime. | ||
| // +kubebuilder:validation:XValidation:rule="self == oldSelf",message="Value is immutable" |
There was a problem hiding this comment.
Note: PodSpecOverrides is now immutable.
The addition of the XValidation rule makes PodSpecOverrides immutable after creation. This is a breaking change that prevents users from modifying pod spec overrides on existing TrainJobs. Ensure this is the intended behavior and that users are properly notified through release notes or migration guides.
| } | ||
|
|
||
| // +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch;update;patch | ||
| // +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch |
There was a problem hiding this comment.
Critical: RBAC verbs don't match controller operations.
The RBAC annotation removes update and patch verbs, but the controller calls r.client.Update() on lines 85 and 88 to add or remove finalizers. Without these permissions, the controller will fail at runtime with permission denied errors.
Apply this diff to restore the required RBAC verbs:
-// +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch
+// +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch;update;patch📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch | |
| // +kubebuilder:rbac:groups=trainer.kubeflow.org,resources=clustertrainingruntimes,verbs=get;list;watch;update;patch |
🤖 Prompt for AI Agents
In pkg/controller/clustertrainingruntime_controller.go around line 65 the RBAC
kubebuilder annotation only grants get;list;watch but the controller calls
r.client.Update() on lines ~85-88 to add/remove finalizers; update the
annotation to include update and patch (e.g., verbs=get;list;watch;update;patch)
so the controller has permission to update ClusterTrainingRuntime objects, then
regenerate manifests if your project uses controller-gen.
| newCommand = append(newCommand, | ||
| fmt.Sprintf("%s %s-%s-0-0.%s:%d", | ||
| constants.TorchTuneArgRdzvEndpoint, | ||
| trainJob.Name, constants.Node, trainJob.Name, constants.ContainerTrainerPort, | ||
| ), | ||
| ) | ||
|
|
||
| // 2. Get the recipe and config from old args and append them to newCommand. | ||
| recipe, config := getRecipeAndConfig( | ||
| numNodes, | ||
| numProcPerNode, | ||
| getModelFromRuntimeRef(trainJob.Spec.RuntimeRef.Name), | ||
| trainJob.Spec.Trainer.Args, | ||
| ) | ||
| newCommand = append(newCommand, recipe, constants.TorchTuneArgConfig, config) | ||
| numNodes := ptr.Deref(ptr.Deref(trainerPS, runtime.PodSet{}).Count, 1) | ||
| model := getModelFromRuntimeRef(trainJob.Spec.RuntimeRef.Name) | ||
| recipe, config := getRecipeAndConfig(numNodes, numProcPerNode, model, trainJob.Spec.Trainer.Args) | ||
| newCommand = append(newCommand, recipe, fmt.Sprintf("--config %s", config)) | ||
|
|
There was a problem hiding this comment.
Split rendezvous/config flags into separate argv entries. fmt.Sprintf("%s %s-...") and fmt.Sprintf("--config %s", config) each collapse a flag-plus-argument into one token, so torchtune never sees the expected separate --rdzv_endpoint and --config parameters. The command will fail to start rendezvous. Please append flags and values as distinct arguments.
newCommand = append(newCommand,
- fmt.Sprintf("%s %s-%s-0-0.%s:%d",
- constants.TorchTuneArgRdzvEndpoint,
- trainJob.Name, constants.Node, trainJob.Name, constants.ContainerTrainerPort,
- ),
- )
+ constants.TorchTuneArgRdzvEndpoint,
+ fmt.Sprintf("%s-%s-0-0.%s:%d",
+ trainJob.Name, constants.Node, trainJob.Name, constants.ContainerTrainerPort,
+ ),
+ )
@@
- newCommand = append(newCommand, recipe, fmt.Sprintf("--config %s", config))
+ newCommand = append(newCommand, recipe, "--config", config)
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...format, will close the issue(s) when PR gets merged):Fixes #
Checklist:
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Breaking Changes
Chores