Skip to content

Conversation

@aybchan
Copy link
Member

@aybchan aybchan commented May 30, 2025

Add GKE MaxText train (example run) and NCCL test (example run) workflows with reusable composite action for managing xpk job lifecycle (launch, logs streaming, clean up, artifact upload).

Patches on xpk address the following identified issues:

Cluster create with xpk (example run) - added as a separate workflow for demonstration purposes (will not be operational in the CI)

@aybchan aybchan changed the title Test GKE runner Add GKE runner May 30, 2025
@aybchan aybchan changed the title Add GKE runner Add GKE example May 30, 2025
@aybchan aybchan requested a review from olupton June 19, 2025 07:54
Copy link
Collaborator

@olupton olupton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. The MaxText job hasn't run yet in the current pipeline, so I haven't checked the logs there, but posting a few comments.

Generally I'm in favour of merging this and refining later - it seems low-risk of breaking anything else.

@aybchan aybchan requested a review from olupton July 2, 2025 17:54
@aybchan aybchan merged commit efb11b7 into main Jul 3, 2025
95 of 101 checks passed
@aybchan aybchan deleted the alechan/gke-runner branch July 3, 2025 14:15
aybchan added a commit that referenced this pull request Jul 15, 2025
Add `GKE` `MaxText` train ([example
run](https://github.com/NVIDIA/JAX-Toolbox/actions/runs/15744603099/job/44379358307))
and `NCCL` test ([example
run](https://github.com/NVIDIA/JAX-Toolbox/actions/runs/15744603099/job/44378422712))
workflows with reusable composite action for managing `xpk` job
lifecycle (launch, logs streaming, clean up, artifact upload).

Patches on `xpk` address the following identified issues:
- AI-Hypercomputer/xpk#476
- AI-Hypercomputer/xpk#488
- AI-Hypercomputer/xpk#490
- AI-Hypercomputer/xpk#491
- AI-Hypercomputer/xpk#492

Cluster create with `xpk` ([example
run](https://github.com/NVIDIA/JAX-Toolbox/actions/runs/15591134618/job/43910254644#step:5:1))
- added as a separate
[workflow](https://github.com/NVIDIA/JAX-Toolbox/pull/1481/files#diff-801fc28cafbf1e0fa0ea521355fa8a1c9e6c01dcb8b1083c47f66e2ead4d560a)
for demonstration purposes (will not be operational in the CI)

---------

Co-authored-by: Olli Lupton <[email protected]>
aybchan added a commit that referenced this pull request Aug 11, 2025
Upgrades xpk version used in GKE-xpk action for running workloads from
`v0.8.0` to `v0.10.1`.

This latest release includes some fixes and feature requests that were
[made](https://github.com/AI-Hypercomputer/xpk/issues?q=is%3Aissue%20state%3Aclosed%20author%3Aaybchan)
based on issues with xpk found in previous work
#1481

While the fixes address most of the issues found previously, we still
need to add the following patches to use `v0.10.1` for running cluster
workloads:
- workload patch due to
AI-Hypercomputer/xpk#577
- tcpxo_decorator patch due to container NCCL unavailability
([example](https://github.com/NVIDIA/JAX-Toolbox/actions/runs/16827982684/job/47671112135#step:7:1119))
without explicitly prepending path in `LD_LIBRARY_PATH` to override the
default NCCL library from the host-mounted directory
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants