-
Notifications
You must be signed in to change notification settings - Fork 561
Enable ROCm CI support #1786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Enable ROCm CI support #1786
Conversation
…g ubuntu folder for cuda Dockerfile.
…Fixed error in integration_tests.py. Fixed lint errors.
…_job_v2.yml for integration_test_8gpu.yaml.
…ily available to run the workflow.
No ciflow labels are configured for this repo. |
…n in linux_job_v2.yml.
… and move_aws_steps_inside_setup_rocm branch.
…tures tests inside integration_test_8gpu_features.yaml. Using linux_job_v2.yml from the main branch. Rolled back to using 8 GPU runner for ROCm.
|
||
jobs: | ||
build-test: | ||
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using the main
branch here, could you try to use the branch 7311
from pytorch/test-infra#7329 to test this out:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main | |
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@7311 |
We can revert it back to main if the test works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akashveramd We merged pytorch/pytorch#164769 so we can switch back to main branch now. @huydhn also confirmed that the AWS role was updated, and the latest retriggered job passed after that: https://github.com/pytorch/torchtitan/actions/runs/18391595425/job/52493188786?pr=1786
Please retrigger after switching back to main branch. It seems there'll be a new error related to artifact directory creation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jithunnair-amd: Switched back to main. I see failures in the rocm workflow-
mkdir: cannot create directory 'artifacts-to-be-uploaded': Permission denied
https://github.com/pytorch/torchtitan/actions/runs/18424663754/job/52504224020?pr=1786
This PR is based out of the original PR #1260.
The original PR was created in a different fork, and it was having issues setting up aws inside the workflow. Since the workflow was running from a forked PR.