CI: Add ROCm nightly docker workflow#3115
Conversation
| working-directory: .ci/docker | ||
| run: | | ||
| export PYTORCH_ROCM_ARCH="${{ env.PYTORCH_ROCM_ARCH }}" | ||
| ./build.sh pytorch-linux-jammy-rocm${{ env.ROCM_VERSION }}-py${{ env.PYTHON_VERSION }} \ |
There was a problem hiding this comment.
| ./build.sh pytorch-linux-jammy-rocm${{ env.ROCM_VERSION }}-py${{ env.PYTHON_VERSION }} \ | |
| ./build.sh pytorch-linux-jammy-rocm-n-py3 \ |
along with a sed in the build.sh to replace with ROCM_VERSION we want to build in this workflow: https://github.com/pytorch/pytorch/blob/5e73467572a7b3a5508f1216cedf5fe6f1ac5ce0/.ci/docker/build.sh#L162
| - .github/scripts/rocm_nightly_debug_build.sh | ||
|
|
||
| env: | ||
| ROCM_VERSION: '7.2.1' |
There was a problem hiding this comment.
| ROCM_VERSION: '7.2.1' | |
| ROCM_VERSION: '7.2.2' |
|
@leo-automation to check if increasing MTU size on runner nodes might resolve network issues such as https://github.com/ROCm/pytorch/actions/runs/24400425702/job/71269225000 |
|
Jenkins build for 807c7a11d7e2982b3c5e653b91aa4b61b61879a2 commit finished as NOT_BUILT |
| ENV CI=1 | ||
| ENV PYTORCH_TEST_WITH_ROCM=1 | ||
| ENV PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" | ||
| ENV USE_NVSHMEM=0 |
There was a problem hiding this comment.
@leo-automation Please add a comment stating that this is TODO and TEMPORARY and a reason why it's there
| RUN git clone https://github.com/pytorch/pytorch --recursive \ | ||
| && cd pytorch \ | ||
| # Bypass sccache on torch_rocshmem: its -fgpu-rdc + mixed xnack± offload-arch flags break sccache's argv parser. | ||
| && sed -i 's|set_target_properties(torch_rocshmem PROPERTIES LINKER_LANGUAGE HIP)|set_target_properties(torch_rocshmem PROPERTIES LINKER_LANGUAGE HIP CXX_COMPILER_LAUNCHER "" HIP_COMPILER_LAUNCHER "")|' caffe2/CMakeLists.txt \ |
There was a problem hiding this comment.
@leo-automation Do we still need this if we have the USE_NVSHMEM=0 above? If not, we could leave this in a comment if you don't want to lose it
| && git config --local user.email "amd@amd.com" \ | ||
| && git remote add rocm https://github.com/ROCm/pytorch.git \ | ||
| && git fetch rocm \ | ||
| && git cherry-pick 519160d466782f5a62365be051fcb3ef90fa0b00 \ |
| env: | ||
| ROCM_VERSION: '7.2.2' | ||
| PYTHON_VERSION: '3.10' | ||
| PYTORCH_ROCM_ARCH: 'gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201' |
There was a problem hiding this comment.
@pruthvistony @jeffdaily @pragupta @jerrymannil @jataylo Since these images are primarily for dev use, do we really need to build for all archs by default? We can always build for a specific arch if needed, by manually triggering the workflow (would need to ensure that manual triggers use a different docker image tag so it doesn't overwrite the cron-based ones). I propose we build only for the most-commonly required GFX archs for dev use.
My proposal: "gfx90a gfx942 gfx950"
Any Radeon ones needed?
|
Jenkins build for 807c7a11d7e2982b3c5e653b91aa4b61b61879a2 commit finished as FAILURE |
|
Jenkins build for fb1c009dec41f8dd2083d9e28be2fb7714fe464e commit finished as NOT_BUILT |
|
Jenkins build for 20df855ecfe6114da39c5403157eba72e5f49853 commit finished as NOT_BUILT |
|
Jenkins build for 20df855ecfe6114da39c5403157eba72e5f49853 commit finished as FAILURE |
|
Jenkins build for 077f47cf15cb4d2fb62b9f3fb8f7de17520b5650 commit finished as FAILURE |
|
Jenkins build for 012f03505ff5cafb0aceb085dcc30230540c51a9 commit finished as FAILURE |
|
Jenkins build for 88ec33078145d1c899001ecbc60a0bce195af050 commit finished as FAILURE Detected error during Pytorch building: |
|
Jenkins build for 93b2d31c392008b0cfe4788a1a332f67553d4996 commit finished as NOT_BUILT |
|
Jenkins build for 93b2d31c392008b0cfe4788a1a332f67553d4996 commit finished as FAILURE |
93b2d31 to
7b8dd18
Compare
|
Jenkins build for 7b8dd188ab294f8c1d36cb5b5b898fdadc06f7f9 commit finished as FAILURE |
Motivation
Migrating from Jenkins. This PR adds ROCm-specific nightly docker automation so we can build and validate a dedicated ROCm nightly image outside the standard PyTorch docker release flow
Technical Details
.github/workflows/pytorch-nightly-docker.ymlto build and push a ROCm nightly image, support an optionalrocm_versionoverride, and run a ROCm smoke-test job against the produced image..ci/docker/pytorch-nightly-docker.Dockerfileto build PyTorch, torchvision, and torchaudio on top of the ROCm nightly base image.Test Result