Skip to content

ci: onboard GB200 testing#1893

Open
ko3n1g wants to merge 6 commits intomainfrom
ko3n1g/feat/gb200-functional-tests
Open

ci: onboard GB200 testing#1893
ko3n1g wants to merge 6 commits intomainfrom
ko3n1g/feat/gb200-functional-tests

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Apr 17, 2026

Summary

Adds GB200 (ARM64/GCP) support to the container build and test pipeline.

Container build — matrix pattern (same as Megatron-Bridge / Megatron-LM):

  • Adds `container-registry-gb200` env var pointing to the GAR repository for ARM64 images
  • Adds `cicd-compute-build-matrix` job that outputs a two-entry matrix: AMD64 → Azure ACR on `{runner_prefix}-gpu-x2`, ARM64 → GAR on `nemo-ci-gcp-gpu-x2`
  • Refactors `cicd-container-build` to use the matrix strategy — inlines steps previously delegated to `.github/actions/build-container`; Azure login steps are gated on `contains(matrix.registry, 'azure')` so the GCP runner never touches ACR

Test registration and discovery:
Extends Automodel's existing static-matrix approach with a parallel `cicd-e2e-tests-gb200` job. Same shape as `cicd-e2e-tests` (explicit `test-name` / `test-folder` / `timeout` per entry), fixed to `nemo-ci-gcp-gpu-x2` and the GAR image. Engineers add or remove GB200 test entries directly in the matrix — no scripts, no scanning.

Bug fix:
`Azure ACR Login` in `test-template/action.yml` was unconditional — added `if: inputs.has-azure-credentials == 'true'` guard to prevent `az` failures on GCP runners.

Build matrix

Entry Runner Registry Platform
aws `{runner_prefix}-gpu-x2` `nemoci.azurecr.io` linux/amd64
gcp `nemo-ci-gcp-gpu-x2` GAR (`us-east4-docker.pkg.dev/…/automodel`) linux/arm64

Prerequisite: NeMo-CI-TF-States#72 must be merged first to provision the `automodel` GAR repository.

Test plan

  • Trigger `workflow_dispatch` — confirm AMD64 build pushes to Azure ACR and ARM64 build pushes to GAR on the GCP runner
  • `cicd-e2e-tests-gb200` jobs appear and run on `nemo-ci-gcp-gpu-x2` pulling the GAR image
  • H100 test jobs unaffected (still use `nemoci.azurecr.io`)

🤖 Generated with Claude Code

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g changed the title ci: add matrix container build for AMD64 and ARM64 (GB200) ci: add GB200 (ARM64) container build and e2e test track Apr 17, 2026
@ko3n1g ko3n1g changed the title ci: add GB200 (ARM64) container build and e2e test track ci: onboard GB200 testing Apr 17, 2026
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant