Skip to content

Conversation

akashveramd
Copy link
Collaborator

This PR is based out of the original PR #1260.
The original PR was created in a different fork, and it was having issues setting up aws inside the workflow. Since the workflow was running from a forked PR.

…Fixed error in integration_tests.py. Fixed lint errors.
@akashveramd akashveramd self-assigned this Oct 2, 2025
@akashveramd akashveramd requested a review from wconstab as a code owner October 2, 2025 18:42
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 2, 2025
Copy link

pytorch-bot bot commented Oct 2, 2025

No ciflow labels are configured for this repo.
For information on how to enable CIFlow bot see this wiki

… and move_aws_steps_inside_setup_rocm branch.
…tures tests inside integration_test_8gpu_features.yaml. Using linux_job_v2.yml from the main branch. Rolled back to using 8 GPU runner for ROCm.

jobs:
build-test:
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using the main branch here, could you try to use the branch 7311 from pytorch/test-infra#7329 to test this out:

Suggested change
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@main
uses: pytorch/test-infra/.github/workflows/linux_job_v2.yml@7311

We can revert it back to main if the test works

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akashveramd We merged pytorch/pytorch#164769 so we can switch back to main branch now. @huydhn also confirmed that the AWS role was updated, and the latest retriggered job passed after that: https://github.com/pytorch/torchtitan/actions/runs/18391595425/job/52493188786?pr=1786

Please retrigger after switching back to main branch. It seems there'll be a new error related to artifact directory creation.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jithunnair-amd: Switched back to main. I see failures in the rocm workflow-
mkdir: cannot create directory 'artifacts-to-be-uploaded': Permission denied
https://github.com/pytorch/torchtitan/actions/runs/18424663754/job/52504224020?pr=1786

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/rocm CLA Signed This label is managed by the Meta Open Source bot. module: rocm

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants