Skip to content

feat(ci): migrate from Azure to AWS ephemeral runners#15620

Draft
ko3n1g wants to merge 8 commits intomainfrom
ko3n1g/ci/aws-ephemeral-runners
Draft

feat(ci): migrate from Azure to AWS ephemeral runners#15620
ko3n1g wants to merge 8 commits intomainfrom
ko3n1g/ci/aws-ephemeral-runners

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Apr 17, 2026

Summary

  • Adds is-not-external-contributor job: NVIDIA SSO members get nemo-ci-aws-gpu-x2 runners; external contributors get nemo-ci-aws-gpu-x2-ephemeral isolated runners
  • Migrates all runners from self-hosted-azure* to nemo-ci-aws-gpu-x2 AWS runners
  • Switches container registry from nemoci.azurecr.io (Azure ACR) to 766267172432.dkr.ecr.us-east-1.amazonaws.com/nemo-speech (AWS ECR)
  • Rebuilds _build_container.yml as an inline docker/build-push-action job (removes FW-CI-templates delegation) so ECR auth works on AWS runners
  • Propagates built image URLs through workflow outputs so all downstream jobs reference the correct ECR image tag (format: nemo-speech:<image-name>-<run_id>)
  • Updates test-template action to accept a full image URL instead of constructing nemoci.azurecr.io/<name>:<run_id> internally
  • Aligns PR-info resolution with NeMo's pull_request event model (mirrors Megatron-Bridge#3370): uses github.event.pull_request.user.login / .number directly instead of Bors-style nv-gha-runners/get-pr-info action

Example image references

Image Tag pattern
Main CI 766267172432.dkr.ecr.us-east-1.amazonaws.com/nemo-speech:nemo_container-<run_id>
Speech CI 766267172432.dkr.ecr.us-east-1.amazonaws.com/nemo-speech:nemo_container_speech-<run_id>

Test plan

  • Verify is-not-external-contributor job passes for internal contributor PRs
  • Verify cicd-test-container-build builds and pushes to ECR successfully
  • Verify cicd-import-tests pulls the correct ECR image
  • Verify unit tests in cicd-main-unit-tests run with the ECR image
  • Verify speech tests in cicd-main-speech build and run the speech ECR image

🤖 Generated with Claude Code

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Replace nemoci.azurecr.io (Azure ACR) with
766267172432.dkr.ecr.us-east-1.amazonaws.com/nemo-speech (AWS ECR)
across all CI workflows. Rebuild _build_container.yml as an inline
docker/build-push-action job so ECR registry access works on AWS
runners. Image tags embed the image-name prefix (e.g.
nemo_container-<run_id>) since all images share one ECR repository.

Propagate image URLs through workflow outputs so downstream jobs
reference the correct ECR tag. Update test-template action to accept
a full image URL. Add root checkout to each job using the local
action so ./.github/actions/ resolves correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Replace ubuntu-latest with inputs.runner in all CPU-only matrix entries
in cicd-main-unit-tests.yml and cicd-main-speech.yml.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Add test-data-path input to test-template action and wire
vars.DEFAULT_TEST_DATA_PATH through all callers. Defaults to
/mnt/datadrive/TestData when the variable is unset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Replace Bors-style `nv-gha-runners/get-pr-info` action with direct
`github.event.pull_request.*` context, following the pattern from
Megatron-Bridge/pull/3370 adapted for NeMo's standard PR events:

- cicd-main.yml: remove `get-pr-info` step; use
  `github.event.pull_request.user.login` for SSO username lookup;
  bump checkout action from v4 to v6
- _build_container.yml: remove Bors-style `get-pr-info` step (condition
  `startsWith(refs/heads/pull-request/)` never fires for pull_request
  events); use `github.event.pull_request.number` for cache keys

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
uuidgen (uuid-runtime) is not installed on nemo-ci-aws-gpu-x2 runners.
Use python3 -c 'import uuid; print(uuid.uuid4())' which is always
available in the NeMo container environment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g added CI Run CICD and removed CI labels Apr 18, 2026
@github-actions github-actions bot removed the Run CICD label Apr 18, 2026
@ko3n1g ko3n1g marked this pull request as draft April 18, 2026 16:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants