Import Trainer SDK #4

szaher · 2025-04-23T18:55:29Z

Imported Trainer SDK with history from kubeflow/trainer
Fixes #1

Signed-off-by: Yuki Iwai <[email protected]>

… (#1875) Signed-off-by: Yuki Iwai <[email protected]>

* Removing dead code * Update monitoring guide

Signed-off-by: Yuki Iwai <[email protected]>

* Removing dead code * Update monitoring guide * Changelog changes * Update CHANGELOG.md

Signed-off-by: lowang-bh <[email protected]>

* update full change list in changelog Signed-off-by: lowang-bh <[email protected]> * Update CHANGELOG.md Co-authored-by: Yuki Iwai <[email protected]> --------- Signed-off-by: lowang-bh <[email protected]> Co-authored-by: Yuki Iwai <[email protected]>

* Removing dead code * Update monitoring guide * Changelog changes * Adding tenzen to Approvers list * Merge changes * Merge changes * Sort alphabetically * Sort alphabetically

* Create Dockerfile * Update Dockerfile * Create deliver-kubectl.sh * Update publish-core-images.yaml * Using kubeflow kubectl-delivery * Delete scripts/kubectl-delivery/deliver-kubectl.sh * Refactor Dockerfile * Create Dockerfile * Update Dockerfile * Create deliver-kubectl.sh * Update publish-core-images.yaml * Using kubeflow kubectl-delivery * Delete scripts/kubectl-delivery/deliver-kubectl.sh * Refactor Dockerfile --------- Co-authored-by: ULBRICR <[email protected]>

* Build XGBoostJob example images in CI Signed-off-by: Yuki Iwai <[email protected]> * Organize example manifests Signed-off-by: Yuki Iwai <[email protected]> * Fix action files Signed-off-by: Yuki Iwai <[email protected]> * Replace image names Signed-off-by: Yuki Iwai <[email protected]> --------- Signed-off-by: Yuki Iwai <[email protected]>

* Add Flake and Black Lint * Change SDK APIs * Update E2E tests * Fix a few function parameters * Fix black format * Fix a few comments * Fix conftest location * Fix Job kind in tests * Fix client creation in test * Fix namespace arg in get_job_conditions * Update SDK examples with the latest changes * Rename SDK examples * Fix black action * Update checkout action version Co-authored-by: Yuki Iwai <[email protected]> * Use Black 23.9.1 version * Fix GitHub Action for Black * Add unit test to create PyTorchJob from func * Rename timeout to wait_timeout * Validate that Job is not set with other input parameters * Update black in developer guide * Remove pip_index_url validation * Use locals to verify input * Print Job info when E2E fails * Remove duplicated delete --------- Co-authored-by: Yuki Iwai <[email protected]>

Signed-off-by: Yuki Iwai <[email protected]>

* Bump k8s.io/* deps to 1.28 - Bump k8s.io/* deps to 1.28 - Fix metrics bind address assignment in manager setup - Rename metrics-port flag to webhook-server-port as it was wrongly used * Revert envtest 1.27 and use generate-groups.sh - Revert envtest to 1.27 - Use generate-groups.sh instead of kube_codegen.sh * Revert monitoring port flag

* Fixing issues with providing existing service account * Removing test case * Update pkg/controller.v1/mpi/mpijob_controller.go Co-authored-by: Andrey Velichkevich <[email protected]> * Adding test again for service account * Fixed testcase * Fix further test issue * Finally fixing test * Continue test fix * Further bugfix * Formatting code * Removing event logging if service account is not owned by MPI operator * Update pkg/controller.v1/mpi/mpijob_controller_test.go Co-authored-by: Yuki Iwai <[email protected]> --------- Co-authored-by: Andrey Velichkevich <[email protected]> Co-authored-by: Yuki Iwai <[email protected]>

Signed-off-by: Yuki Iwai <[email protected]>

…port` (#1925) - Change the flag name for the webhook server port from `monitoringPort` to `webhookServerPort` Signed-off-by: Andreas Fritzler <[email protected]>

…lt. (#1929) Signed-off-by: Syulin7 <[email protected]>

…ainer images (#1931) Signed-off-by: Yuki Iwai <[email protected]>

Bumps [golang.org/x/net](https://github.com/golang/net) from 0.13.0 to 0.17.0. - [Commits](golang/net@v0.13.0...v0.17.0) --- updated-dependencies: - dependency-name: golang.org/x/net dependency-type: indirect ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

… version of object (#2414) Signed-off-by: Yuki Iwai <[email protected]>

* fix(apis): change the group of API to trainer.kubeflow.org. Signed-off-by: Electronic-Waste <[email protected]> * chore(manifests): update crds in manifests using make manifests. Signed-off-by: Electronic-Waste <[email protected]> * chore: change the apia dir name to trainer.kubeflow.org and update reference. Signed-off-by: Electronic-Waste <[email protected]> * chore: execute make generate. Signed-off-by: Electronic-Waste <[email protected]> * fix: remove remaining kubeflow.org dirs. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove outdated docs & update models reference. Signed-off-by: Electronic-Waste <[email protected]> * fix: rename apis dir to ttrainer. Signed-off-by: Electronic-Waste <[email protected]> * chore: execute make generate. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove outdated docs & update models reference. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): update model reference in code. Signed-off-by: Electronic-Waste <[email protected]> * fix(doc): update api grou p in KEP-2170. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>

* Update the naming conventions for Kubeflow Trainer Signed-off-by: Andrey Velichkevich <[email protected]> * Fix webhooks Signed-off-by: Andrey Velichkevich <[email protected]> * Fix paths for webhooks Signed-off-by: Andrey Velichkevich <[email protected]> * Update go test cmd Signed-off-by: Andrey Velichkevich <[email protected]> * Rename kubeflowv1 to trainer pkg Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>

* Implement MPI Plugin for Kubeflow Trainer Signed-off-by: Andrey Velichkevich <[email protected]> * Update RBAC Signed-off-by: Andrey Velichkevich <[email protected]> * Remove old manifests Signed-off-by: Andrey Velichkevich <[email protected]> * Fix unit test Signed-off-by: Andrey Velichkevich <[email protected]> * Fix comments Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>

Signed-off-by: Antonin Stefanutti <[email protected]>

Signed-off-by: Electronic-Waste <[email protected]>

Signed-off-by: Andrey Velichkevich <[email protected]>

* fix(sdk): import kubernetes.client & make type conversion in swagger.json. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): change Union[int, str] to object. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>

* Add e2e tests for Kubeflow Trainer Signed-off-by: Andrey Velichkevich <[email protected]> * Add timeout for papermill Signed-off-by: Andrey Velichkevich <[email protected]> * Add output as part of make command Signed-off-by: Andrey Velichkevich <[email protected]> * Add k8s version to setup cluster Signed-off-by: Andrey Velichkevich <[email protected]> * Fix Kind k8s version Signed-off-by: Andrey Velichkevich <[email protected]> * Fix 1.29 version Signed-off-by: Andrey Velichkevich <[email protected]> * Create script to run Notebook Signed-off-by: Andrey Velichkevich <[email protected]> * Download dataset when local_rank=0 Signed-off-by: Andrey Velichkevich <[email protected]> * Update test/e2e/e2e_test.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> * Refactor Go e2e tests Signed-off-by: Andrey Velichkevich <[email protected]> * Bump k8s to 1.29.14 Signed-off-by: Andrey Velichkevich <[email protected]> * Install Kind from go mod Signed-off-by: Andrey Velichkevich <[email protected]> * Fix path for Kind package Signed-off-by: Andrey Velichkevich <[email protected]> * Fix Go e2e Signed-off-by: Andrey Velichkevich <[email protected]> * Reduce number of CPUs Export Notebook as artifact Signed-off-by: Andrey Velichkevich <[email protected]> * Print logs due to flaky test Signed-off-by: Andrey Velichkevich <[email protected]> * Fix artifact path Signed-off-by: Andrey Velichkevich <[email protected]> * docker pull image Signed-off-by: Andrey Velichkevich <[email protected]> * Fix path Signed-off-by: Andrey Velichkevich <[email protected]> * Add k8s version to output name Signed-off-by: Andrey Velichkevich <[email protected]> * Remove install Kind cmd Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Yuki Iwai <[email protected]>

* feat(sdk): Generate external Kubernetes and JobSet models Signed-off-by: Andrey Velichkevich <[email protected]> * Update JobSet commit in Swagger Signed-off-by: Andrey Velichkevich <[email protected]> * Fix formats for resources and num proc Signed-off-by: Andrey Velichkevich <[email protected]> * Add custom Kind config Signed-off-by: Andrey Velichkevich <[email protected]> * Use ginkgo to run e2e Signed-off-by: Andrey Velichkevich <[email protected]> * Revert Kind config change Signed-off-by: Andrey Velichkevich <[email protected]> * Fix ginkgo binary Signed-off-by: Andrey Velichkevich <[email protected]> * Print controller logs in case of failure Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>

Signed-off-by: Yuki Iwai <[email protected]>

* feat(controller): Integrate DependsOn API Signed-off-by: Andrey Velichkevich <[email protected]> * Use go for unit test Signed-off-by: Andrey Velichkevich <[email protected]> * Update Makefile Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> * Update Makefile Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Andrey Velichkevich <[email protected]> * Fix integration test Signed-off-by: Andrey Velichkevich <[email protected]> * Fix e2e Signed-off-by: Andrey Velichkevich <[email protected]> * Exit 1 if e2e fails Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]> Co-authored-by: Yuki Iwai <[email protected]>

* feat(sdk): Migrate to OpenAPI V3 Signed-off-by: Andrey Velichkevich <[email protected]> * Update SDK to support OpenAPI V3 Signed-off-by: Andrey Velichkevich <[email protected]> * Remove metadata check Signed-off-by: Andrey Velichkevich <[email protected]> * Assign nil for empty default value Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>

* fix(sdk): rename Trainer to CustomTrainer. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove validate_trainer(). Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove lora related code. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove get_lora_config() Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): fix import error in __init__.py Signed-off-by: Electronic-Waste <[email protected]> * chore(example): update the image-classification example. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): delete remaining lora related code. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): modify args description in CustomTrainer. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): add parameter type in CustomTrainer dataclass. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): update args in CustomTrainer. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>

…nJob (#2492) * Fix #2407: Cap nproc_per_node based on CPU resources for PyTorch TrainJob Signed-off-by: Diasker <[email protected]> * Refactor the fix using resource.MustParse and RoundUp Signed-off-by: Diasker <[email protected]> * remove unecessary log message Signed-off-by: Diasker <[email protected]> * update test case to expect nproc_per_node=1 without GPU Signed-off-by: Diasker <[email protected]> * Implement numProcPerNode=cpu option and refactor tests. Signed-off-by: Diasker <[email protected]> * refactor: improve torch test cases Signed-off-by: Diasker <[email protected]> * remove the redundant logic Signed-off-by: Diasker <[email protected]> * Update pkg/runtime/framework/plugins/torch/torch.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Diasker <[email protected]> * Update pkg/runtime/framework/plugins/torch/torch_test.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Diasker <[email protected]> * simply code and remove redundant comments Signed-off-by: Diasker <[email protected]> * Update pkg/runtime/framework/plugins/torch/torch.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Diasker <[email protected]> * Update pkg/runtime/framework/plugins/torch/torch_test.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Diasker <[email protected]> * Update pkg/runtime/framework/plugins/torch/torch.go Co-authored-by: Yuki Iwai <[email protected]> Signed-off-by: Diasker <[email protected]> * remove get_num_proc_per_node for sdk and simply torch.go Signed-off-by: Diasker <[email protected]> * Update expected nproc_per_node value in trainjob_controller_test.go Signed-off-by: Diasker <[email protected]> * Update expected nproc_per_node value in trainjob_controller_test.go and remove redundant code of sdk Signed-off-by: Diasker <[email protected]> * update trainjob_controller_test.go Signed-off-by: Diasker <[email protected]> * update torch.go Signed-off-by: Diasker <[email protected]> --------- Signed-off-by: Diasker <[email protected]> Signed-off-by: Diasker <[email protected]> Co-authored-by: Yuki Iwai <[email protected]>

* feat(controller): Refactor the Initializer APIs of TrainJob Signed-off-by: Andrey Velichkevich <[email protected]> * Fix go unit test Signed-off-by: Andrey Velichkevich <[email protected]> * Fix integration test Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>

Signed-off-by: Electronic-Waste <[email protected]>

* feat(sdk): Support MPI-based TrainJobs Signed-off-by: Andrey Velichkevich <[email protected]> * Refactor list_runtimes Signed-off-by: Andrey Velichkevich <[email protected]> * Fix example Signed-off-by: Andrey Velichkevich <[email protected]> * Add Runtime Trainer object Signed-off-by: Andrey Velichkevich <[email protected]> * Update for new Runtime object Signed-off-by: Andrey Velichkevich <[email protected]> * Implement get_runtime API Signed-off-by: Andrey Velichkevich <[email protected]> * Fix Torch example Signed-off-by: Andrey Velichkevich <[email protected]> * Remove un-unsed consts Signed-off-by: Andrey Velichkevich <[email protected]> * Update func args Signed-off-by: Andrey Velichkevich <[email protected]> * Update SDK constants Signed-off-by: Andrey Velichkevich <[email protected]> * Change to 16Gi Signed-off-by: Andrey Velichkevich <[email protected]> * Fix container name for MPI Signed-off-by: Andrey Velichkevich <[email protected]> * Keep launcher container for MPI Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>

* fix(sdk): Using correct entrypoint for mpirun Signed-off-by: Andrey Velichkevich <[email protected]> * Fix torchrun entrypoint Signed-off-by: Andrey Velichkevich <[email protected]> * Allow to configure mpiuser home dir Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>

Signed-off-by: Electronic-Waste <[email protected]>

* feat(runtimes): Support DeepSpeed Runtime Signed-off-by: Andrey Velichkevich <[email protected]> * Downgrade OpenMPI to 4.0 version Signed-off-by: Andrey Velichkevich <[email protected]> * Fix the runtime spec Signed-off-by: Andrey Velichkevich <[email protected]> * Reuse sshd config from MPI operator Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>

* feat(sdk): add TorchTuneConfig. Signed-off-by: Electronic-Waste <[email protected]> * rebase(sdk): rebase on the newest master branch. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add args description in train(). Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add description for train() func. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): split train() according to trainer and fine_tuning_config Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): update the launching command and args. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add get_args_using_torchtune_config. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): fix some wrong description in train() Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): update the description of fine_tuning_config. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove numProcPernode in train(). Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add TorchTuneConfig in train() Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): add torchtune logic in train() Signed-off-by: Electronic-Waste <[email protected]> * feat(sdk): add BuiltinTrainer. Signed-off-by: Electronic-Waste <[email protected]> * feat(sdk): add BuiltinTrainer logic. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add description for initializer. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): update description of runtime Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): address unresolved merge error. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove duplicated fields. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): fix train() description according to the review. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add get_trainer_crd_from_custom_trainer Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add get_trainer_crd_from_builtin_trainer and refactor train() API. Signed-off-by: Electronic-Waste <[email protected]> * chore(sdk): add Loss enum class. Signed-off-by: Electronic-Waste <[email protected]> * fix(doc): update loss type in KEP-2401 Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): remove BuiltinTrainer in train(). Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): add enum type for dtype. Signed-off-by: Electronic-Waste <[email protected]> * fix(doc): update KEP according to the type update of dtype. Signed-off-by: Electronic-Waste <[email protected]> * fix(doc): update the type description in the table. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): update dtype validation in utils. Signed-off-by: Electronic-Waste <[email protected]> * fix(sdk): update dtype override according to the review. Signed-off-by: Electronic-Waste <[email protected]> --------- Signed-off-by: Electronic-Waste <[email protected]>

* feat(runtimes): Support MLX Distributed Runtime with OpenMPI Signed-off-by: Andrey Velichkevich <[email protected]> * Remove sshd config Signed-off-by: Andrey Velichkevich <[email protected]> * Modify make install for OpenMPI Signed-off-by: Andrey Velichkevich <[email protected]> * Use Debian for MLX Runtime Signed-off-by: Andrey Velichkevich <[email protected]> * Install Python package for mpiuser Signed-off-by: Andrey Velichkevich <[email protected]> * Remove generator Signed-off-by: Andrey Velichkevich <[email protected]> * Fix mpiuser permission Signed-off-by: Andrey Velichkevich <[email protected]> * Create .cache Signed-off-by: Andrey Velichkevich <[email protected]> * Update MLX example Signed-off-by: Andrey Velichkevich <[email protected]> --------- Signed-off-by: Andrey Velichkevich <[email protected]>

Signed-off-by: Andrey Velichkevich <[email protected]>

google-oss-prow · 2025-04-23T18:55:36Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

tenzen-y and others added 30 commits July 24, 2023 09:33

Set up controllers using goroutines to start the manager quickly (#1869)

fdf6b83

Signed-off-by: Yuki Iwai <[email protected]>

Upgrade scheduler-plugins version to v0.26.0 (#1871)

886698d

Signed-off-by: Yuki Iwai <[email protected]>

Upgrade Go version to v1.20 (#1873)

de7c73e

Signed-off-by: Yuki Iwai <[email protected]>

Replace Pytorch with PyTorch (#1874)

e208389

Signed-off-by: Yuki Iwai <[email protected]>

Implement integration test for MPIJob v1 related to suspend semantics…

4dd0d09

… (#1875) Signed-off-by: Yuki Iwai <[email protected]>

Removing reconciler code (#1879)

3e08340

Update Prometheus metrics in doc (#1880)

855e096

* Removing dead code * Update monitoring guide

Remove klog v1 (#1886)

48dbbf0

Signed-off-by: Yuki Iwai <[email protected]>

Refactor core/pod tests (#1890)

11b7a11

Signed-off-by: Yuki Iwai <[email protected]>

Add Stale GitHub Action (#1893)

17f8ff2

Changelog updated for 1.7.0 rc0 release (#1892)

3107ab7

* Removing dead code * Update monitoring guide * Changelog changes * Update CHANGELOG.md

update volcano scheduler to 1.8.0 (#1894)

a8bd3a5

Signed-off-by: lowang-bh <[email protected]>

Add Training WG Community Call (#1900)

e6a3c70

docs: Remove reference to tf-operator specific design doc (#1903)

834a221

Adding Yuki to Approvers (#1901)

12eefea

* Removing dead code * Update monitoring guide * Changelog changes * Adding tenzen to Approvers list * Merge changes * Merge changes * Sort alphabetically * Sort alphabetically

Creating service account where approriate for MPI Job (#1917)

288d680

Replace XGBoost image for E2E with community hosted (#1922)

04936a0

Signed-off-by: Yuki Iwai <[email protected]>

Build MXJob examples in CI (#1927)

7183081

Signed-off-by: Yuki Iwai <[email protected]>

Use a community hosted image in MXJob E2E (#1928)

95f2553

Signed-off-by: Yuki Iwai <[email protected]>

⚠️ Breaking Changes: Rename monitoring-port flag to `webook-server-…

a4c0cec

…port` (#1925) - Change the flag name for the webhook server port from `monitoringPort` to `webhookServerPort` Signed-off-by: Andreas Fritzler <[email protected]>

Check podGroup CRD for the volcano and the scheudler-plugins as defau…

f5f4717

…lt. (#1929) Signed-off-by: Syulin7 <[email protected]>

Increase the root volume size on the github runner when building cont…

4f1d3fa

…ainer images (#1931) Signed-off-by: Yuki Iwai <[email protected]>

tenzen-y and others added 25 commits February 5, 2025 03:16

ControlPlane: Fix flaky integraion testings due to missing the latest…

562cf97

… version of object (#2414) Signed-off-by: Yuki Iwai <[email protected]>

KEP-2170: Add validation to Torch numProcPerNode field (#2409)

47225bf

Signed-off-by: Antonin Stefanutti <[email protected]>

Upgrade jobset SDK version to v0.7.3 (#2445)

382529b

Signed-off-by: Electronic-Waste <[email protected]>

Bump JobSet to v0.8.0 (#2463)

3c91592

Signed-off-by: Andrey Velichkevich <[email protected]>

Implement MPI numProcPerNode defaulter (#2483)

629080c

Signed-off-by: Yuki Iwai <[email protected]>

fix(sdk): add missing import type Initializer. (#2541)

4130632

Signed-off-by: Electronic-Waste <[email protected]>

fix(sdk): Add missing import types. (#2566)

671d756

Signed-off-by: Electronic-Waste <[email protected]>

feat(sdk): Get namespace from the provided context (#2593)

fedee49

Signed-off-by: Andrey Velichkevich <[email protected]>

Merge branch 'import-trainer-sdk' into imported-trainer-sdk

4d262c7

google-oss-prow bot added the size/XXL label Apr 23, 2025

google-oss-prow bot requested review from andreyvelich, Electronic-Waste and tenzen-y April 23, 2025 18:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Import Trainer SDK #4

Import Trainer SDK #4

szaher commented Apr 23, 2025 •

edited

Loading

google-oss-prow bot commented Apr 23, 2025

Import Trainer SDK #4

Are you sure you want to change the base?

Import Trainer SDK #4

Conversation

szaher commented Apr 23, 2025 • edited Loading

google-oss-prow bot commented Apr 23, 2025

szaher commented Apr 23, 2025 •

edited

Loading