Skip to content

Import Trainer SDK #4

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1,156 commits into
base: main
Choose a base branch
from
Open

Import Trainer SDK #4

wants to merge 1,156 commits into from

Conversation

szaher
Copy link

@szaher szaher commented Apr 23, 2025

Imported Trainer SDK with history from kubeflow/trainer
Fixes #1

tenzen-y and others added 30 commits July 24, 2023 09:33
* Removing dead code

* Update monitoring guide
Signed-off-by: Yuki Iwai <[email protected]>
* Removing dead code

* Update monitoring guide

* Changelog changes

* Update CHANGELOG.md
* update full change list in changelog

Signed-off-by: lowang-bh <[email protected]>

* Update CHANGELOG.md

Co-authored-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: lowang-bh <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
* Removing dead code

* Update monitoring guide

* Changelog changes

* Adding tenzen to Approvers list

* Merge changes

* Merge changes

* Sort alphabetically

* Sort alphabetically
* Create Dockerfile

* Update Dockerfile

* Create deliver-kubectl.sh

* Update publish-core-images.yaml

* Using kubeflow kubectl-delivery

* Delete scripts/kubectl-delivery/deliver-kubectl.sh

* Refactor Dockerfile

* Create Dockerfile

* Update Dockerfile

* Create deliver-kubectl.sh

* Update publish-core-images.yaml

* Using kubeflow kubectl-delivery

* Delete scripts/kubectl-delivery/deliver-kubectl.sh

* Refactor Dockerfile

---------

Co-authored-by: ULBRICR <[email protected]>
* Build XGBoostJob example images in CI

Signed-off-by: Yuki Iwai <[email protected]>

* Organize example manifests

Signed-off-by: Yuki Iwai <[email protected]>

* Fix action files

Signed-off-by: Yuki Iwai <[email protected]>

* Replace image names

Signed-off-by: Yuki Iwai <[email protected]>

---------

Signed-off-by: Yuki Iwai <[email protected]>
* Add Flake and Black Lint

* Change SDK APIs

* Update E2E tests

* Fix a few function parameters

* Fix black format

* Fix a few comments

* Fix conftest location

* Fix Job kind in tests

* Fix client creation in test

* Fix namespace arg in get_job_conditions

* Update SDK examples with the latest changes

* Rename SDK examples

* Fix black action

* Update checkout action version

Co-authored-by: Yuki Iwai <[email protected]>

* Use Black 23.9.1 version

* Fix GitHub Action for Black

* Add unit test to create PyTorchJob from func

* Rename timeout to wait_timeout

* Validate that Job is not set with other input parameters

* Update black in developer guide

* Remove pip_index_url validation

* Use locals to verify input

* Print Job info when E2E fails

* Remove duplicated delete

---------

Co-authored-by: Yuki Iwai <[email protected]>
* Bump k8s.io/* deps to 1.28

- Bump k8s.io/* deps to 1.28
- Fix metrics bind address assignment in manager setup
- Rename metrics-port flag to webhook-server-port as it was wrongly used

* Revert envtest 1.27 and use generate-groups.sh

- Revert envtest to 1.27
- Use generate-groups.sh instead of kube_codegen.sh

* Revert monitoring port flag
* Fixing issues with providing existing service account

* Removing test case

* Update pkg/controller.v1/mpi/mpijob_controller.go

Co-authored-by: Andrey Velichkevich <[email protected]>

* Adding test again for service account

* Fixed testcase

* Fix further test issue

* Finally fixing test

* Continue test fix

* Further bugfix

* Formatting code

* Removing event logging if service account is not owned by MPI operator

* Update pkg/controller.v1/mpi/mpijob_controller_test.go

Co-authored-by: Yuki Iwai <[email protected]>

---------

Co-authored-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
…port` (#1925)

- Change the flag name for the webhook server port from `monitoringPort` to `webhookServerPort`

Signed-off-by: Andreas Fritzler <[email protected]>
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.13.0 to 0.17.0.
- [Commits](golang/net@v0.13.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Bumps [golang.org/x/net](https://github.com/golang/net) from 0.13.0 to 0.17.0.
- [Commits](golang/net@v0.13.0...v0.17.0)

---
updated-dependencies:
- dependency-name: golang.org/x/net
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
tenzen-y and others added 25 commits February 5, 2025 03:16
* fix(apis): change the group of API to trainer.kubeflow.org.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(manifests): update crds in manifests using make manifests.

Signed-off-by: Electronic-Waste <[email protected]>

* chore: change the apia dir name to trainer.kubeflow.org and update reference.

Signed-off-by: Electronic-Waste <[email protected]>

* chore: execute make generate.

Signed-off-by: Electronic-Waste <[email protected]>

* fix: remove remaining kubeflow.org dirs.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove outdated docs & update models reference.

Signed-off-by: Electronic-Waste <[email protected]>

* fix: rename apis dir to ttrainer.

Signed-off-by: Electronic-Waste <[email protected]>

* chore: execute make generate.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove outdated docs & update models reference.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): update model reference in code.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(doc): update api grou p in KEP-2170.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
* Update the naming conventions for Kubeflow Trainer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix webhooks

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix paths for webhooks

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update go test cmd

Signed-off-by: Andrey Velichkevich <[email protected]>

* Rename kubeflowv1 to trainer pkg

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
* Implement MPI Plugin for Kubeflow Trainer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update RBAC

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove old manifests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix unit test

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix comments

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>
* fix(sdk): import kubernetes.client & make type conversion in swagger.json.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): change Union[int, str] to object.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
* Add e2e tests for Kubeflow Trainer

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add timeout for papermill

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add output as part of make command

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add k8s version to setup cluster

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix Kind k8s version

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix 1.29 version

Signed-off-by: Andrey Velichkevich <[email protected]>

* Create script to run Notebook

Signed-off-by: Andrey Velichkevich <[email protected]>

* Download dataset when local_rank=0

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update test/e2e/e2e_test.go

Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

* Refactor Go e2e tests

Signed-off-by: Andrey Velichkevich <[email protected]>

* Bump k8s to 1.29.14

Signed-off-by: Andrey Velichkevich <[email protected]>

* Install Kind from go mod

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix path for Kind package

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix Go e2e

Signed-off-by: Andrey Velichkevich <[email protected]>

* Reduce number of CPUs
Export Notebook as artifact

Signed-off-by: Andrey Velichkevich <[email protected]>

* Print logs due to flaky test

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix artifact path

Signed-off-by: Andrey Velichkevich <[email protected]>

* docker pull image

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix path

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add k8s version to output name

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove install Kind cmd

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
* feat(sdk): Generate external Kubernetes and JobSet models

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update JobSet commit in Swagger

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix formats for resources and num proc

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add custom Kind config

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use ginkgo to run e2e

Signed-off-by: Andrey Velichkevich <[email protected]>

* Revert Kind config change

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix ginkgo binary

Signed-off-by: Andrey Velichkevich <[email protected]>

* Print controller logs in case of failure

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
* feat(controller): Integrate DependsOn API

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use go for unit test

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update Makefile

Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

* Update Makefile

Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration test

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix e2e

Signed-off-by: Andrey Velichkevich <[email protected]>

* Exit 1 if e2e fails

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
* feat(sdk): Migrate to OpenAPI V3

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update SDK to support OpenAPI V3

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove metadata check

Signed-off-by: Andrey Velichkevich <[email protected]>

* Assign nil for empty default value

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
* fix(sdk): rename Trainer to CustomTrainer.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove validate_trainer().

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove lora related code.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove get_lora_config()

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): fix import error in __init__.py

Signed-off-by: Electronic-Waste <[email protected]>

* chore(example): update the image-classification example.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): delete remaining lora related code.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): modify args description in CustomTrainer.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): add parameter type in CustomTrainer dataclass.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): update args in CustomTrainer.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
…nJob (#2492)

* Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob

Signed-off-by: Diasker <[email protected]>

* Refactor the fix using resource.MustParse and RoundUp

Signed-off-by: Diasker <[email protected]>

* remove unecessary log message

Signed-off-by: Diasker <[email protected]>

* update test case to expect nproc_per_node=1 without GPU

Signed-off-by: Diasker <[email protected]>

* Implement numProcPerNode=cpu option and refactor tests.

Signed-off-by: Diasker <[email protected]>

* refactor: improve torch test cases

Signed-off-by: Diasker <[email protected]>

* remove the redundant logic

Signed-off-by: Diasker <[email protected]>

* Update pkg/runtime/framework/plugins/torch/torch.go

Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Diasker <[email protected]>

* Update pkg/runtime/framework/plugins/torch/torch_test.go

Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Diasker <[email protected]>

* simply code and remove redundant comments

Signed-off-by: Diasker <[email protected]>

* Update pkg/runtime/framework/plugins/torch/torch.go

Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Diasker <[email protected]>

* Update pkg/runtime/framework/plugins/torch/torch_test.go

Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Diasker <[email protected]>

* Update pkg/runtime/framework/plugins/torch/torch.go

Co-authored-by: Yuki Iwai <[email protected]>
Signed-off-by: Diasker <[email protected]>

* remove get_num_proc_per_node for sdk and simply torch.go

Signed-off-by: Diasker <[email protected]>

* Update expected nproc_per_node value in trainjob_controller_test.go

Signed-off-by: Diasker <[email protected]>

* Update expected nproc_per_node value in trainjob_controller_test.go and remove redundant code of sdk

Signed-off-by: Diasker <[email protected]>

* update trainjob_controller_test.go

Signed-off-by: Diasker <[email protected]>

* update torch.go

Signed-off-by: Diasker <[email protected]>

---------

Signed-off-by: Diasker <[email protected]>
Signed-off-by: Diasker <[email protected]>
Co-authored-by: Yuki Iwai <[email protected]>
* feat(controller): Refactor the Initializer APIs of TrainJob

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix go unit test

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix integration test

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
* feat(sdk): Support MPI-based TrainJobs

Signed-off-by: Andrey Velichkevich <[email protected]>

* Refactor list_runtimes

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix example

Signed-off-by: Andrey Velichkevich <[email protected]>

* Add Runtime Trainer object

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update for new Runtime object

Signed-off-by: Andrey Velichkevich <[email protected]>

* Implement get_runtime API

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix Torch example

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove un-unsed consts

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update func args

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update SDK constants

Signed-off-by: Andrey Velichkevich <[email protected]>

* Change to 16Gi

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix container name for MPI

Signed-off-by: Andrey Velichkevich <[email protected]>

* Keep launcher container for MPI

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
* fix(sdk): Using correct entrypoint for mpirun

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix torchrun entrypoint

Signed-off-by: Andrey Velichkevich <[email protected]>

* Allow to configure mpiuser home dir

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
* feat(runtimes): Support DeepSpeed Runtime

Signed-off-by: Andrey Velichkevich <[email protected]>

* Downgrade OpenMPI to 4.0 version

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix the runtime spec

Signed-off-by: Andrey Velichkevich <[email protected]>

* Reuse sshd config from MPI operator

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
* feat(sdk): add TorchTuneConfig.

Signed-off-by: Electronic-Waste <[email protected]>

* rebase(sdk): rebase on the newest master branch.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): add args description in train().

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): add description for train() func.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): split train() according to trainer and fine_tuning_config

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): update the launching command and args.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): add get_args_using_torchtune_config.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): fix some wrong description in train()

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): update the description of fine_tuning_config.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove numProcPernode in train().

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): add TorchTuneConfig in train()

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): add torchtune logic in train()

Signed-off-by: Electronic-Waste <[email protected]>

* feat(sdk): add BuiltinTrainer.

Signed-off-by: Electronic-Waste <[email protected]>

* feat(sdk): add BuiltinTrainer logic.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): add description for initializer.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): update description of runtime

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): address unresolved merge error.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove duplicated fields.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): fix train() description according to the review.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): add get_trainer_crd_from_custom_trainer

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): add get_trainer_crd_from_builtin_trainer and refactor train() API.

Signed-off-by: Electronic-Waste <[email protected]>

* chore(sdk): add Loss enum class.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(doc): update loss type in KEP-2401

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): remove BuiltinTrainer in train().

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): add enum type for dtype.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(doc): update KEP according to the type update of dtype.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(doc): update the type description in the table.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): update dtype validation in utils.

Signed-off-by: Electronic-Waste <[email protected]>

* fix(sdk): update dtype override according to the review.

Signed-off-by: Electronic-Waste <[email protected]>

---------

Signed-off-by: Electronic-Waste <[email protected]>
* feat(runtimes): Support MLX Distributed Runtime with OpenMPI

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove sshd config

Signed-off-by: Andrey Velichkevich <[email protected]>

* Modify make install for OpenMPI

Signed-off-by: Andrey Velichkevich <[email protected]>

* Use Debian for MLX Runtime

Signed-off-by: Andrey Velichkevich <[email protected]>

* Install Python package for mpiuser

Signed-off-by: Andrey Velichkevich <[email protected]>

* Remove generator

Signed-off-by: Andrey Velichkevich <[email protected]>

* Fix mpiuser permission

Signed-off-by: Andrey Velichkevich <[email protected]>

* Create .cache

Signed-off-by: Andrey Velichkevich <[email protected]>

* Update MLX example

Signed-off-by: Andrey Velichkevich <[email protected]>

---------

Signed-off-by: Andrey Velichkevich <[email protected]>
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Setup the Kubeflow SDK repository