Skip to content

Commit 1fcbde0

Browse files
authored
Upgrade xpk version from v0.8.0 to v0.10.1 (#1608)
Upgrades xpk version used in GKE-xpk action for running workloads from `v0.8.0` to `v0.10.1`. This latest release includes some fixes and feature requests that were [made](https://github.com/AI-Hypercomputer/xpk/issues?q=is%3Aissue%20state%3Aclosed%20author%3Aaybchan) based on issues with xpk found in previous work #1481 While the fixes address most of the issues found previously, we still need to add the following patches to use `v0.10.1` for running cluster workloads: - workload patch due to AI-Hypercomputer/xpk#577 - tcpxo_decorator patch due to container NCCL unavailability ([example](https://github.com/NVIDIA/JAX-Toolbox/actions/runs/16827982684/job/47671112135#step:7:1119)) without explicitly prepending path in `LD_LIBRARY_PATH` to override the default NCCL library from the host-mounted directory
1 parent 7b74227 commit 1fcbde0

File tree

3 files changed

+56
-15
lines changed

3 files changed

+56
-15
lines changed

.github/actions/gke-xpk/action.yml

Lines changed: 27 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -70,7 +70,7 @@ inputs:
7070
XPK_VERSION:
7171
description: 'XPK release tag'
7272
required: false
73-
default: 'v0.8.0'
73+
default: 'v0.10.1'
7474
type: string
7575
XPK_PYTHON:
7676
description: 'Python version for XPK'
@@ -119,9 +119,8 @@ runs:
119119
shell: bash -x -e -u {0}
120120
run: |
121121
sed -i 's/{{ IMAGE_PULL_SECRET_NAME }}/${{ inputs.IMAGE_PULL_SECRET_NAME }}/g' .github/gke-workflow/xpk/${{ inputs.XPK_VERSION}}/workload.patch
122-
git apply --unsafe-paths .github/gke-workflow/xpk/${{ inputs.XPK_VERSION}}/tcpxo_decorator.patch --directory ${WORKLOAD_NAME}/xpk
123-
git apply --unsafe-paths .github/gke-workflow/xpk/${{ inputs.XPK_VERSION}}/docker_resources.patch --directory ${WORKLOAD_NAME}/xpk
124-
git apply --unsafe-paths .github/gke-workflow/xpk/${{ inputs.XPK_VERSION}}/workload.patch --directory ${WORKLOAD_NAME}/xpk
122+
PATCH_PATH=.github/gke-workflow/xpk/${{ inputs.XPK_VERSION}}
123+
ls ${PATCH_PATH} | xargs -I {} git apply --unsafe-paths ${PATCH_PATH}/{} --directory ${WORKLOAD_NAME}/xpk
125124
126125
- name: Set workload commands
127126
shell: bash -x -e -u {0}
@@ -158,18 +157,31 @@ runs:
158157
run: |
159158
source ${WORKLOAD_NAME}/.venv/bin/activate
160159
cd ${WORKLOAD_NAME}/xpk
160+
161+
args=(
162+
--project=${{ inputs.GCP_PROJECT }}
163+
--cluster=${{ inputs.GKE_CLUSTER }}
164+
--zone=${{ inputs.GCP_ZONE }}
165+
--workload=${WORKLOAD_NAME}
166+
--docker-image=${{ inputs.IMAGE }}
167+
--device-type=${{ inputs.CLUSTER_DEVICE }}
168+
--num-nodes=${{ inputs.NUM_NODES }}
169+
--num-slices=${{ inputs.NUM_NODES }}
170+
--priority=high
171+
--scheduler=gke.io/topology-aware-auto
172+
)
173+
174+
if [[ "${{ inputs.XPK_VERSION }}" == "v0.10.1" ]]; then
175+
args+=(
176+
--docker-image-pull-secret=${{ inputs.IMAGE_PULL_SECRET_NAME }}
177+
--env="JAX_COORDINATOR_PORT=3389"
178+
--env="JAX_COORDINATOR_ADDRESS=\$(JOBSET_NAME)-\$(REPLICATED_JOB_NAME)-0-0.\$(JOBSET_NAME):3389"
179+
)
180+
fi
181+
161182
python xpk.py workload create \
162-
--project ${{ inputs.GCP_PROJECT }} \
163-
--cluster ${{ inputs.GKE_CLUSTER }} \
164-
--zone ${{ inputs.GCP_ZONE }} \
165-
--workload ${WORKLOAD_NAME} \
166-
--docker-image ${{ inputs.IMAGE }} \
167-
--device-type ${{ inputs.CLUSTER_DEVICE }} \
168-
--num-nodes ${{ inputs.NUM_NODES }} \
169-
--num-slices ${{ inputs.NUM_NODES }} \
170-
--priority=high \
171-
--scheduler=gke.io/topology-aware-auto \
172-
--command "${PRELUDE} ${CMD} ${POSTLUDE}"
183+
${args[@]} \
184+
--command="${PRELUDE} ${CMD} ${POSTLUDE}"
173185
174186
- name: Wait for JobSet to unsuspend on cluster
175187
shell: bash -u {0}
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
diff --git a/src/xpk/core/workload_decorators/tcpxo_decorator.py b/src/xpk/core/workload_decorators/tcpxo_decorator.py
2+
index 771cafe..e455f2a 100644
3+
--- a/src/xpk/core/workload_decorators/tcpxo_decorator.py
4+
+++ b/src/xpk/core/workload_decorators/tcpxo_decorator.py
5+
@@ -180,7 +180,7 @@ def update_gpu_containers(job_manifest):
6+
if 'nvidia.com/gpu' in container.get('resources', {}).get('limits', {}):
7+
container.setdefault('env', [])
8+
container['env'].append(
9+
- {'name': 'LD_LIBRARY_PATH', 'value': '/usr/local/nvidia/lib64'}
10+
+ {'name': 'LD_LIBRARY_PATH', 'value': '/opt/nvidia/nccl/lib:/usr/local/nvidia/lib64'}
11+
)
12+
container['env'].append({
13+
'name': 'NCCL_FASTRAK_LLCM_DEVICE_DIRECTORY',
Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
diff --git a/src/xpk/commands/workload.py b/src/xpk/commands/workload.py
2+
index 0231bab..25e34eb 100644
3+
--- a/src/xpk/commands/workload.py
4+
+++ b/src/xpk/commands/workload.py
5+
@@ -482,8 +482,9 @@ def workload_create(args) -> None:
6+
flex=True if capacity_type == CapacityType.FLEX_START else False,
7+
)
8+
else (
9+
- 'kueue.x-k8s.io/podset-preferred-topology:'
10+
- ' "cloud.google.com/gce-topology-host"'
11+
+ """
12+
+ kueue.x-k8s.io/podset-preferred-topology: "cloud.google.com/gce-topology-host"
13+
+ """
14+
)
15+
)
16+

0 commit comments

Comments
 (0)