feat(ci): migrate from Azure to AWS ephemeral runners#15620
Draft
feat(ci): migrate from Azure to AWS ephemeral runners#15620
Conversation
Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Replace nemoci.azurecr.io (Azure ACR) with 766267172432.dkr.ecr.us-east-1.amazonaws.com/nemo-speech (AWS ECR) across all CI workflows. Rebuild _build_container.yml as an inline docker/build-push-action job so ECR registry access works on AWS runners. Image tags embed the image-name prefix (e.g. nemo_container-<run_id>) since all images share one ECR repository. Propagate image URLs through workflow outputs so downstream jobs reference the correct ECR tag. Update test-template action to accept a full image URL. Add root checkout to each job using the local action so ./.github/actions/ resolves correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
9d11be8 to
7e9c29f
Compare
Replace ubuntu-latest with inputs.runner in all CPU-only matrix entries in cicd-main-unit-tests.yml and cicd-main-speech.yml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Add test-data-path input to test-template action and wire vars.DEFAULT_TEST_DATA_PATH through all callers. Defaults to /mnt/datadrive/TestData when the variable is unset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Replace Bors-style `nv-gha-runners/get-pr-info` action with direct `github.event.pull_request.*` context, following the pattern from Megatron-Bridge/pull/3370 adapted for NeMo's standard PR events: - cicd-main.yml: remove `get-pr-info` step; use `github.event.pull_request.user.login` for SSO username lookup; bump checkout action from v4 to v6 - _build_container.yml: remove Bors-style `get-pr-info` step (condition `startsWith(refs/heads/pull-request/)` never fires for pull_request events); use `github.event.pull_request.number` for cache keys Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
uuidgen (uuid-runtime) is not installed on nemo-ci-aws-gpu-x2 runners. Use python3 -c 'import uuid; print(uuid.uuid4())' which is always available in the NeMo container environment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
is-not-external-contributorjob: NVIDIA SSO members getnemo-ci-aws-gpu-x2runners; external contributors getnemo-ci-aws-gpu-x2-ephemeralisolated runnersself-hosted-azure*tonemo-ci-aws-gpu-x2AWS runnersnemoci.azurecr.io(Azure ACR) to766267172432.dkr.ecr.us-east-1.amazonaws.com/nemo-speech(AWS ECR)_build_container.ymlas an inlinedocker/build-push-actionjob (removes FW-CI-templates delegation) so ECR auth works on AWS runnersnemo-speech:<image-name>-<run_id>)test-templateaction to accept a full image URL instead of constructingnemoci.azurecr.io/<name>:<run_id>internallypull_requestevent model (mirrors Megatron-Bridge#3370): usesgithub.event.pull_request.user.login/.numberdirectly instead of Bors-stylenv-gha-runners/get-pr-infoactionExample image references
766267172432.dkr.ecr.us-east-1.amazonaws.com/nemo-speech:nemo_container-<run_id>766267172432.dkr.ecr.us-east-1.amazonaws.com/nemo-speech:nemo_container_speech-<run_id>Test plan
is-not-external-contributorjob passes for internal contributor PRscicd-test-container-buildbuilds and pushes to ECR successfullycicd-import-testspulls the correct ECR imagecicd-main-unit-testsrun with the ECR imagecicd-main-speechbuild and run the speech ECR image🤖 Generated with Claude Code