[WIP] feat: Implement controller support for polling and tracking training job progression#17
Closed
abhijeet-dhumal wants to merge 122 commits intoopendatahub-io:mainfrom
Closed
Conversation
* fix(docs): convert commits to list in changelog.py for compatibility Signed-off-by: kramaranya <kramaranya15@gmail.com> * chore(docs): add Changelog for Trainer v2.0.0-rc.0 Signed-off-by: kramaranya <kramaranya15@gmail.com> --------- Signed-off-by: kramaranya <kramaranya15@gmail.com>
…nShift (kubeflow#2682) Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
…#2685) * chore(runtime): Bump Torch to 2.7.1 and DeepSpeed to 0.17.1 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update cuda to 12.8 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…w#2382) * Add the manifests overlay for Kubeflow Training V2 Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Signed-off-by: Xinmin Du <2812493086@qq.com> * Update manifest: adjust permissions, and format changes Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Signed-off-by: Xinmin Du <2812493086@qq.com> * Update manifest: rename overlay, adjust event permissions Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Signed-off-by: Xinmin Du <2812493086@qq.com> * Update manifest: make namespace configurable Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Signed-off-by: Xinmin Du <2812493086@qq.com> * Update manifest: move standalone, only-manager installation in namespace: kubeflow-system Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Signed-off-by: Xinmin Du <2812493086@qq.com> * Update manifest: add overlay for Kubeflow Platform installation Signed-off-by: Xinmin Du <2812493086@qq.com> * add permission for pods log read & rm persistentvolumeclaims Signed-off-by: Xinmin Du <2812493086@qq.com> * create the runtimes before the webhooks Signed-off-by: Xinmin Du <2812493086@qq.com> * Specify sorting order: fifo Signed-off-by: Xinmin Du <2812493086@qq.com> * Deploy jobset first Signed-off-by: Xinmin Du <2812493086@qq.com> * remove edit permissions to runtimes; install runtimes after crds Signed-off-by: Xinmin Du <2812493086@qq.com> * remove pretraining directory Signed-off-by: Xinmin Du <2812493086@qq.com> * patch runtimes images Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: correct image Signed-off-by: Xinmin Du <2812493086@qq.com> * add image patch for more runtimes Signed-off-by: Xinmin Du <2812493086@qq.com> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * Update manifests/overlays/kubeflow-platform/kubeflow-trainer-roles.yaml Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Signed-off-by: Du Xinmin <2812493086@qq.com> * role_bind for notebook & profile Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: reorder images Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: reuse overlay/manager & runtimes Signed-off-by: Xinmin Du <2812493086@qq.com> * fix: remove namespace with patch Signed-off-by: Xinmin Du <2812493086@qq.com> --------- Signed-off-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Signed-off-by: Xinmin Du <2812493086@qq.com> Signed-off-by: Du Xinmin <2812493086@qq.com> Co-authored-by: Xinmin Du <10803082+doris-xm@user.noreply.gitee.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…ith CTR and TrainJob yaml files (kubeflow#2669) * chore(mainfests): include torchtune runtimes. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(manifests): Update torchtune runtimes.: Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(manifests): Update mounting path in CTRs. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(manifests): Update output_dir. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(manifests): Update numProcPerNode to auto. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com>
…w#2675) * fix(plugins): fix errors in trainer command mutation of torchtune. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(plugins): remove config file format suffix. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(test): update UTs. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(initializer): Update the workspace of dataset/model initializer. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(plugins): parse nproc_per_node from GPU resource. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(torchtune): Add bitsandbytes dependency in requirements.txt Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(lint): fix lint error. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(torchtune): Remove unnecessary num_proc_per_node calculation. Signed-off-by: Electronic-Waste <2690692950@qq.com> * test(torch): Update invalid parameters. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
…ubeflow#2695) Signed-off-by: Yuki Iwai <yuki.iwai.tz@gmail.com>
* feat: Mutable PodSpecOverrides for suspended TrainJob Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Include @tenzen-y review Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Add unit tests Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* feat(example): Add alpaca-trianjob-yaml.ipynb. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(example): Update the overview of the torchtune llama3_2 example. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(example): Update the pvc description. Signed-off-by: Electronic-Waste <2690692950@qq.com> * chore(example): Add the get the fine-tuned model section. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(example): Fix some errors. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(example): fix some errors. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(manifests): Fix debug tag. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(example): Change PVC creation method to Python SDK. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(example): Remove config load. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* feat: Add schedulingGates to PodSpecOverrides Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Change desired job to target job in PodSpecOverrides comments Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
* fix(module): Change Go module name to v2 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Bump x/net to v0.38.0 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* chore(docs): Add Changelog for v2.0.0-rc.1 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Move example to misc Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* Add Red Hat to ADOPTERS.md Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> * Update ADOPTERS.md Signed-off-by: Yuan Tang <terrytangyuan@gmail.com> --------- Signed-off-by: Yuan Tang <terrytangyuan@gmail.com>
…d to job (kubeflow#2719) Signed-off-by: rudeigerc <rudeigerc@gmail.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…ow#2731) Signed-off-by: rudeigerc <rudeigerc@gmail.com>
* chore(ci): Add GitHub action to verify PR titles Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use operator scope Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add examples scope Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add scripts to scope Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add exporter Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * add wip ignore label Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add PR title to the contrib guide Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Ignore dependencies label Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix text Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use action only on master branch Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…ue template (kubeflow#2732) Signed-off-by: rudeigerc <rudeigerc@gmail.com>
… jobset (kubeflow#2734) Signed-off-by: rudeigerc <rudeigerc@gmail.com>
Signed-off-by: Koray Oksay <koray.oksay@gmail.com>
* chore(docs): Add Changelog for Kubeflow Trainer v2.0.0 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add links for blog post and migration guide Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add links for blog post and website Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* feat(docs): Kubeflow Trainer ROADMAP 2025 Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update roadmap Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add issue for Trainer UI Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add issues for MPI and plugin extension Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add issues for builtin trainers Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
…ubeflow#2754) Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…s for data_cache (kubeflow#2890) Signed-off-by: Akshay Chitneni <achitneni@apple.com> Co-authored-by: Akshay Chitneni <achitneni@apple.com>
kubeflow#2898) Signed-off-by: Xinmin Du <2812493086@qq.com>
…obs (kubeflow#2653) * feat(runtime): add support for launcher resource allocation in MPI jobs Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add unit tests Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Set numProcPerNode for MPI plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Move util func to runtime package Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix torchtune plugin Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Inline if for GPU check Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Assign container resources once Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Add todo for test wrappers Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…obs (kubeflow#2722) * feat(webhook): Add validation for required containers in replicatedJobs. Signed-off-by: Electronic-Waste <2690692950@qq.com> * test(webhook): Add UTs for validation in required containers. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(lint): fix lint error. Signed-off-by: Electronic-Waste <2690692950@qq.com> * fix(webhook): add global map & remove launcher check. Signed-off-by: Electronic-Waste <2690692950@qq.com> --------- Signed-off-by: Electronic-Waste <2690692950@qq.com>
* feat(manager): add controller manager configuration and configmap support Signed-off-by: kapil27 <knema@redhat.com> * refactor: update configmap naming and leader election configuration Signed-off-by: kapil27 <knema@redhat.com> * chore: clean up unused lines in configmap and test files Signed-off-by: kapil27 <knema@redhat.com> --------- Signed-off-by: kapil27 <knema@redhat.com>
Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
kubeflow#2911) * feat(initializer): add s3 model and dataset initializers Signed-off-by: rudeigerc <rudeigerc@gmail.com> * chore: refactor with opendal Signed-off-by: rudeigerc <rudeigerc@gmail.com> * chore: support `role_arn` and add `ignore_patterns` field in the Initializers configs Signed-off-by: rudeigerc <rudeigerc@gmail.com> --------- Signed-off-by: rudeigerc <rudeigerc@gmail.com> Co-authored-by: rudeigerc <rudeigerc@gmail.com>
…ubeflow#2912) * chore(operator): Use SSA throughout runtime framework Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Fix lint error Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> * Update go.mod file Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> --------- Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Co-authored-by: Antonin Stefanutti <antonin@stefanutti.fr>
…harts (kubeflow#2914) Signed-off-by: Antonin Stefanutti <antonin@stefanutti.fr> Co-authored-by: Antonin Stefanutti <antonin@stefanutti.fr>
…branch (kubeflow#2917) * feat(manifests): Publish Trainer Helm Charts (kubeflow#2906) * Solve Remaining Error and bugs Signed-off-by: adity1raut <araut7798@gmail.com> * Solve the confige Signed-off-by: adity1raut <araut7798@gmail.com> * Update The Suggest Change Signed-off-by: adity1raut <araut7798@gmail.com> * Update After REview Signed-off-by: adity1raut <araut7798@gmail.com> * Update the Helm publish action Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update release doc Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Use 0.0.0 version for master branch Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update release doc Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: adity1raut <araut7798@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * fix(manifests): Fix Helm charts image name (kubeflow#2915) * fix(manifests): Fix Helm charts image name Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Always insert appVersion to the Chart.yaml file Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix comment Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Simplify action Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * fix(manifests): Remove the default tag from the controller image (kubeflow#2916) * fix(manifests): Remove the default tag from the controller image Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Fix README template Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: adity1raut <araut7798@gmail.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Aditya Raut <159172287+adity1raut@users.noreply.github.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
…cache nodes (kubeflow#2920) Signed-off-by: Akshay Chitneni <achitneni@apple.com> Co-authored-by: Akshay Chitneni <achitneni@apple.com>
…#2924) * add local docker training example Signed-off-by: Brian Gallagher <briangal@gmail.com> * feat: Adding local execution example notebook Co-authored-by Brian Gallagher <bgallagh@redhat.com> Signed-off-by: Fiona Waters <fiwaters6@gmail.com> --------- Signed-off-by: Brian Gallagher <briangal@gmail.com> Signed-off-by: Fiona Waters <fiwaters6@gmail.com> Co-authored-by: Brian Gallagher <briangal@gmail.com> Co-authored-by: Fiona Waters <fiwaters6@gmail.com>
…ubeflow#2927) * fix(ci): Fix the Kubeflow SDK installation with Docker Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Uncomment delete job in local Notebooks Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> * Update .github/workflows/test-e2e.yaml Co-authored-by: Anya Kramar <akramar@redhat.com> Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> --------- Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Andrey Velichkevich <andrey.velichkevich@gmail.com> Co-authored-by: Anya Kramar <akramar@redhat.com>
…e and example (kubeflow#2928) Signed-off-by: Akshay Chitneni <achitneni@apple.com> Co-authored-by: Akshay Chitneni <achitneni@apple.com>
Signed-off-by: Andrey Velichkevich <andrey.velichkevich@gmail.com>
* created github workflow for trainer * added workflow dispatcher * updating temp quay token in github * Remove odh-kfto-sdk-notebooks-sync workflow * updated build pipeline to use rhoai docker file * removed pre-build commands from build and publish * added multiarch docker file * fixed typo for multiarch * fixed multiarch file * temporary quay push * reverted local build image testing creds * Update Dockerfile.rhoai * update dockerfile.rhoai to dockerfile.odh * fixed nitpick comments * removed odh-release.yaml
- Add RHOAI specific Dockerfile for Trainer V2 controller image - Add RHOAI overlay manifests for Trainer V2 - Add custom training runtimes in rhoai overlay
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the ✨ Finishing touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
e0a5baf to
2542c69
Compare
…ementation Signed-off-by: abhijeet-dhumal <abhijeetdhumal652@gmail.com>
2542c69 to
6be6a80
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does / why we need it:
RHOAIENG-38273
Implement controller support for polling and tracking training job progression from HTTP metrics endpoints exposed by experimental trainers (e.g., TransformersTrainer).
Related to :
opendatahub-io/kubeflow-sdk#21
Blocked by :
#16
Checklist: