-
Notifications
You must be signed in to change notification settings - Fork 5
Import Trainer SDK #4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
… (#1875) Signed-off-by: Yuki Iwai <[email protected]>
* Removing dead code * Update monitoring guide
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
* Removing dead code * Update monitoring guide * Changelog changes * Update CHANGELOG.md
Signed-off-by: lowang-bh <[email protected]>
* update full change list in changelog Signed-off-by: lowang-bh <[email protected]> * Update CHANGELOG.md Co-authored-by: Yuki Iwai <[email protected]> --------- Signed-off-by: lowang-bh <[email protected]> Co-authored-by: Yuki Iwai <[email protected]>
* Removing dead code * Update monitoring guide * Changelog changes * Adding tenzen to Approvers list * Merge changes * Merge changes * Sort alphabetically * Sort alphabetically
* Create Dockerfile * Update Dockerfile * Create deliver-kubectl.sh * Update publish-core-images.yaml * Using kubeflow kubectl-delivery * Delete scripts/kubectl-delivery/deliver-kubectl.sh * Refactor Dockerfile * Create Dockerfile * Update Dockerfile * Create deliver-kubectl.sh * Update publish-core-images.yaml * Using kubeflow kubectl-delivery * Delete scripts/kubectl-delivery/deliver-kubectl.sh * Refactor Dockerfile --------- Co-authored-by: ULBRICR <[email protected]>
* Build XGBoostJob example images in CI Signed-off-by: Yuki Iwai <[email protected]> * Organize example manifests Signed-off-by: Yuki Iwai <[email protected]> * Fix action files Signed-off-by: Yuki Iwai <[email protected]> * Replace image names Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]>
* Add Flake and Black Lint * Change SDK APIs * Update E2E tests * Fix a few function parameters * Fix black format * Fix a few comments * Fix conftest location * Fix Job kind in tests * Fix client creation in test * Fix namespace arg in get_job_conditions * Update SDK examples with the latest changes * Rename SDK examples * Fix black action * Update checkout action version Co-authored-by: Yuki Iwai <[email protected]> * Use Black 23.9.1 version * Fix GitHub Action for Black * Add unit test to create PyTorchJob from func * Rename timeout to wait_timeout * Validate that Job is not set with other input parameters * Update black in developer guide * Remove pip_index_url validation * Use locals to verify input * Print Job info when E2E fails * Remove duplicated delete --------- Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
* Bump k8s.io/* deps to 1.28 - Bump k8s.io/* deps to 1.28 - Fix metrics bind address assignment in manager setup - Rename metrics-port flag to webhook-server-port as it was wrongly used * Revert envtest 1.27 and use generate-groups.sh - Revert envtest to 1.27 - Use generate-groups.sh instead of kube_codegen.sh * Revert monitoring port flag
* Fixing issues with providing existing service account * Removing test case * Update pkg/controller.v1/mpi/mpijob_controller.go Co-authored-by: Andrey Velichkevich <[email protected]> * Adding test again for service account * Fixed testcase * Fix further test issue * Finally fixing test * Continue test fix * Further bugfix * Formatting code * Removing event logging if service account is not owned by MPI operator * Update pkg/controller.v1/mpi/mpijob_controller_test.go Co-authored-by: Yuki Iwai <[email protected]> --------- Co-authored-by: Andrey Velichkevich <[email protected]> Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
…port` (#1925) - Change the flag name for the webhook server port from `monitoringPort` to `webhookServerPort` Signed-off-by: Andreas Fritzler <[email protected]>
…lt. (#1929) Signed-off-by: Syulin7 <[email protected]>
…ainer images (#1931) Signed-off-by: Yuki Iwai <[email protected]>
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.13.0 to 0.17.0. - [Commits](golang/net@v0.13.0...v0.17.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.13.0 to 0.17.0. - [Commits](golang/net@v0.13.0...v0.17.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
… version of object (#2414) Signed-off-by: Yuki Iwai <[email protected]>
* fix(apis): change the group of API to trainer.kubeflow.org. Signed-off-by: Electronic-Waste <[email protected]> * chore(manifests): update crds in manifests using make manifests. Signed-off-by: Electronic-Waste <[email protected]> * chore: change the apia dir name to trainer.kubeflow.org and update reference. Signed-off-by: Electronic-Waste <[email protected]> * chore: execute make generate. Signed-off-by: Electronic-Waste <[email protected]> * fix: remove remaining kubeflow.org dirs. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove outdated docs & update models reference. Signed-off-by: Electronic-Waste <[email protected]> * fix: rename apis dir to ttrainer. Signed-off-by: Electronic-Waste <[email protected]> * chore: execute make generate. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove outdated docs & update models reference. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): update model reference in code. Signed-off-by: Electronic-Waste <[email protected]> * fix(doc): update api grou p in KEP-2170. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
* Update the naming conventions for Kubeflow Trainer Signed-off-by: Andrey Velichkevich <[email protected]> * Fix webhooks Signed-off-by: Andrey Velichkevich <[email protected]> * Fix paths for webhooks Signed-off-by: Andrey Velichkevich <[email protected]> * Update go test cmd Signed-off-by: Andrey Velichkevich <[email protected]> * Rename kubeflowv1 to trainer pkg Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
* Implement MPI Plugin for Kubeflow Trainer Signed-off-by: Andrey Velichkevich <[email protected]> * Update RBAC Signed-off-by: Andrey Velichkevich <[email protected]> * Remove old manifests Signed-off-by: Andrey Velichkevich <[email protected]> * Fix unit test Signed-off-by: Andrey Velichkevich <[email protected]> * Fix comments Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Antonin Stefanutti <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
* fix(sdk): import kubernetes.client & make type conversion in swagger.json. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): change Union[int, str] to object. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
* Add e2e tests for Kubeflow Trainer Signed-off-by: Andrey Velichkevich <[email protected]> * Add timeout for papermill Signed-off-by: Andrey Velichkevich <[email protected]> * Add output as part of make command Signed-off-by: Andrey Velichkevich <[email protected]> * Add k8s version to setup cluster Signed-off-by: Andrey Velichkevich <[email protected]> * Fix Kind k8s version Signed-off-by: Andrey Velichkevich <[email protected]> * Fix 1.29 version Signed-off-by: Andrey Velichkevich <[email protected]> * Create script to run Notebook Signed-off-by: Andrey Velichkevich <[email protected]> * Download dataset when local_rank=0 Signed-off-by: Andrey Velichkevich <[email protected]> * Update test/e2e/e2e_test.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> * Refactor Go e2e tests Signed-off-by: Andrey Velichkevich <[email protected]> * Bump k8s to 1.29.14 Signed-off-by: Andrey Velichkevich <[email protected]> * Install Kind from go mod Signed-off-by: Andrey Velichkevich <[email protected]> * Fix path for Kind package Signed-off-by: Andrey Velichkevich <[email protected]> * Fix Go e2e Signed-off-by: Andrey Velichkevich <[email protected]> * Reduce number of CPUs Export Notebook as artifact Signed-off-by: Andrey Velichkevich <[email protected]> * Print logs due to flaky test Signed-off-by: Andrey Velichkevich <[email protected]> * Fix artifact path Signed-off-by: Andrey Velichkevich <[email protected]> * docker pull image Signed-off-by: Andrey Velichkevich <[email protected]> * Fix path Signed-off-by: Andrey Velichkevich <[email protected]> * Add k8s version to output name Signed-off-by: Andrey Velichkevich <[email protected]> * Remove install Kind cmd Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Yuki Iwai <[email protected]>
* feat(sdk): Generate external Kubernetes and JobSet models Signed-off-by: Andrey Velichkevich <[email protected]> * Update JobSet commit in Swagger Signed-off-by: Andrey Velichkevich <[email protected]> * Fix formats for resources and num proc Signed-off-by: Andrey Velichkevich <[email protected]> * Add custom Kind config Signed-off-by: Andrey Velichkevich <[email protected]> * Use ginkgo to run e2e Signed-off-by: Andrey Velichkevich <[email protected]> * Revert Kind config change Signed-off-by: Andrey Velichkevich <[email protected]> * Fix ginkgo binary Signed-off-by: Andrey Velichkevich <[email protected]> * Print controller logs in case of failure Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Yuki Iwai <[email protected]>
* feat(controller): Integrate DependsOn API Signed-off-by: Andrey Velichkevich <[email protected]> * Use go for unit test Signed-off-by: Andrey Velichkevich <[email protected]> * Update Makefile Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> * Update Makefile Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> * Fix integration test Signed-off-by: Andrey Velichkevich <[email protected]> * Fix e2e Signed-off-by: Andrey Velichkevich <[email protected]> * Exit 1 if e2e fails Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Yuki Iwai <[email protected]>
* feat(sdk): Migrate to OpenAPI V3 Signed-off-by: Andrey Velichkevich <[email protected]> * Update SDK to support OpenAPI V3 Signed-off-by: Andrey Velichkevich <[email protected]> * Remove metadata check Signed-off-by: Andrey Velichkevich <[email protected]> * Assign nil for empty default value Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
* fix(sdk): rename Trainer to CustomTrainer. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove validate_trainer(). Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove lora related code. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove get_lora_config() Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): fix import error in __init__.py Signed-off-by: Electronic-Waste <[email protected]> * chore(example): update the image-classification example. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): delete remaining lora related code. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): modify args description in CustomTrainer. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): add parameter type in CustomTrainer dataclass. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): update args in CustomTrainer. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
…nJob (#2492) * Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob Signed-off-by: Diasker <[email protected]> * Refactor the fix using resource.MustParse and RoundUp Signed-off-by: Diasker <[email protected]> * remove unecessary log message Signed-off-by: Diasker <[email protected]> * update test case to expect nproc_per_node=1 without GPU Signed-off-by: Diasker <[email protected]> * Implement numProcPerNode=cpu option and refactor tests. Signed-off-by: Diasker <[email protected]> * refactor: improve torch test cases Signed-off-by: Diasker <[email protected]> * remove the redundant logic Signed-off-by: Diasker <[email protected]> * Update pkg/runtime/framework/plugins/torch/torch.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Diasker <[email protected]> * Update pkg/runtime/framework/plugins/torch/torch_test.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Diasker <[email protected]> * simply code and remove redundant comments Signed-off-by: Diasker <[email protected]> * Update pkg/runtime/framework/plugins/torch/torch.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Diasker <[email protected]> * Update pkg/runtime/framework/plugins/torch/torch_test.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Diasker <[email protected]> * Update pkg/runtime/framework/plugins/torch/torch.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Diasker <[email protected]> * remove get_num_proc_per_node for sdk and simply torch.go Signed-off-by: Diasker <[email protected]> * Update expected nproc_per_node value in trainjob_controller_test.go Signed-off-by: Diasker <[email protected]> * Update expected nproc_per_node value in trainjob_controller_test.go and remove redundant code of sdk Signed-off-by: Diasker <[email protected]> * update trainjob_controller_test.go Signed-off-by: Diasker <[email protected]> * update torch.go Signed-off-by: Diasker <[email protected]> --------- Signed-off-by: Diasker <[email protected]> Signed-off-by: Diasker <[email protected]> Co-authored-by: Yuki Iwai <[email protected]>
* feat(controller): Refactor the Initializer APIs of TrainJob Signed-off-by: Andrey Velichkevich <[email protected]> * Fix go unit test Signed-off-by: Andrey Velichkevich <[email protected]> * Fix integration test Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
* feat(sdk): Support MPI-based TrainJobs Signed-off-by: Andrey Velichkevich <[email protected]> * Refactor list_runtimes Signed-off-by: Andrey Velichkevich <[email protected]> * Fix example Signed-off-by: Andrey Velichkevich <[email protected]> * Add Runtime Trainer object Signed-off-by: Andrey Velichkevich <[email protected]> * Update for new Runtime object Signed-off-by: Andrey Velichkevich <[email protected]> * Implement get_runtime API Signed-off-by: Andrey Velichkevich <[email protected]> * Fix Torch example Signed-off-by: Andrey Velichkevich <[email protected]> * Remove un-unsed consts Signed-off-by: Andrey Velichkevich <[email protected]> * Update func args Signed-off-by: Andrey Velichkevich <[email protected]> * Update SDK constants Signed-off-by: Andrey Velichkevich <[email protected]> * Change to 16Gi Signed-off-by: Andrey Velichkevich <[email protected]> * Fix container name for MPI Signed-off-by: Andrey Velichkevich <[email protected]> * Keep launcher container for MPI Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
* fix(sdk): Using correct entrypoint for mpirun Signed-off-by: Andrey Velichkevich <[email protected]> * Fix torchrun entrypoint Signed-off-by: Andrey Velichkevich <[email protected]> * Allow to configure mpiuser home dir Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Electronic-Waste <[email protected]>
* feat(runtimes): Support DeepSpeed Runtime Signed-off-by: Andrey Velichkevich <[email protected]> * Downgrade OpenMPI to 4.0 version Signed-off-by: Andrey Velichkevich <[email protected]> * Fix the runtime spec Signed-off-by: Andrey Velichkevich <[email protected]> * Reuse sshd config from MPI operator Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
* feat(sdk): add TorchTuneConfig. Signed-off-by: Electronic-Waste <[email protected]> * rebase(sdk): rebase on the newest master branch. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add args description in train(). Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add description for train() func. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): split train() according to trainer and fine_tuning_config Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): update the launching command and args. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add get_args_using_torchtune_config. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): fix some wrong description in train() Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): update the description of fine_tuning_config. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove numProcPernode in train(). Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add TorchTuneConfig in train() Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): add torchtune logic in train() Signed-off-by: Electronic-Waste <[email protected]> * feat(sdk): add BuiltinTrainer. Signed-off-by: Electronic-Waste <[email protected]> * feat(sdk): add BuiltinTrainer logic. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add description for initializer. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): update description of runtime Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): address unresolved merge error. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove duplicated fields. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): fix train() description according to the review. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add get_trainer_crd_from_custom_trainer Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add get_trainer_crd_from_builtin_trainer and refactor train() API. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add Loss enum class. Signed-off-by: Electronic-Waste <[email protected]> * fix(doc): update loss type in KEP-2401 Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove BuiltinTrainer in train(). Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): add enum type for dtype. Signed-off-by: Electronic-Waste <[email protected]> * fix(doc): update KEP according to the type update of dtype. Signed-off-by: Electronic-Waste <[email protected]> * fix(doc): update the type description in the table. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): update dtype validation in utils. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): update dtype override according to the review. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>
* feat(runtimes): Support MLX Distributed Runtime with OpenMPI Signed-off-by: Andrey Velichkevich <[email protected]> * Remove sshd config Signed-off-by: Andrey Velichkevich <[email protected]> * Modify make install for OpenMPI Signed-off-by: Andrey Velichkevich <[email protected]> * Use Debian for MLX Runtime Signed-off-by: Andrey Velichkevich <[email protected]> * Install Python package for mpiuser Signed-off-by: Andrey Velichkevich <[email protected]> * Remove generator Signed-off-by: Andrey Velichkevich <[email protected]> * Fix mpiuser permission Signed-off-by: Andrey Velichkevich <[email protected]> * Create .cache Signed-off-by: Andrey Velichkevich <[email protected]> * Update MLX example Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Imported Trainer SDK with history from
kubeflow/trainer
Fixes #1