Skip to content

CI: Add ROCm nightly docker workflow#3115

Open
leo-automation wants to merge 51 commits intodevelopfrom
rocm-nightly-gha
Open

CI: Add ROCm nightly docker workflow#3115
leo-automation wants to merge 51 commits intodevelopfrom
rocm-nightly-gha

Conversation

@leo-automation
Copy link
Copy Markdown
Collaborator

Motivation

Migrating from Jenkins. This PR adds ROCm-specific nightly docker automation so we can build and validate a dedicated ROCm nightly image outside the standard PyTorch docker release flow

Technical Details

  • Adds .github/workflows/pytorch-nightly-docker.yml to build and push a ROCm nightly image, support an optional rocm_version override, and run a ROCm smoke-test job against the produced image.
  • Adds .ci/docker/pytorch-nightly-docker.Dockerfile to build PyTorch, torchvision, and torchaudio on top of the ROCm nightly base image.

Test Result

@leo-automation leo-automation removed the request for review from jeffdaily April 1, 2026 13:31
working-directory: .ci/docker
run: |
export PYTORCH_ROCM_ARCH="${{ env.PYTORCH_ROCM_ARCH }}"
./build.sh pytorch-linux-jammy-rocm${{ env.ROCM_VERSION }}-py${{ env.PYTHON_VERSION }} \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
./build.sh pytorch-linux-jammy-rocm${{ env.ROCM_VERSION }}-py${{ env.PYTHON_VERSION }} \
./build.sh pytorch-linux-jammy-rocm-n-py3 \

along with a sed in the build.sh to replace with ROCM_VERSION we want to build in this workflow: https://github.com/pytorch/pytorch/blob/5e73467572a7b3a5508f1216cedf5fe6f1ac5ce0/.ci/docker/build.sh#L162

- .github/scripts/rocm_nightly_debug_build.sh

env:
ROCM_VERSION: '7.2.1'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ROCM_VERSION: '7.2.1'
ROCM_VERSION: '7.2.2'

@jithunnair-amd
Copy link
Copy Markdown
Collaborator

@leo-automation to check if increasing MTU size on runner nodes might resolve network issues such as https://github.com/ROCm/pytorch/actions/runs/24400425702/job/71269225000

@ROCm ROCm deleted a comment from rocm-repo-management-api Bot Apr 22, 2026
@ROCm ROCm deleted a comment from rocm-repo-management-api Bot Apr 22, 2026
@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 23, 2026

Jenkins build for 807c7a11d7e2982b3c5e653b91aa4b61b61879a2 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

ENV CI=1
ENV PYTORCH_TEST_WITH_ROCM=1
ENV PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"
ENV USE_NVSHMEM=0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leo-automation Please add a comment stating that this is TODO and TEMPORARY and a reason why it's there

RUN git clone https://github.com/pytorch/pytorch --recursive \
&& cd pytorch \
# Bypass sccache on torch_rocshmem: its -fgpu-rdc + mixed xnack± offload-arch flags break sccache's argv parser.
&& sed -i 's|set_target_properties(torch_rocshmem PROPERTIES LINKER_LANGUAGE HIP)|set_target_properties(torch_rocshmem PROPERTIES LINKER_LANGUAGE HIP CXX_COMPILER_LAUNCHER "" HIP_COMPILER_LAUNCHER "")|' caffe2/CMakeLists.txt \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leo-automation Do we still need this if we have the USE_NVSHMEM=0 above? If not, we could leave this in a comment if you don't want to lose it

&& git config --local user.email "amd@amd.com" \
&& git remote add rocm https://github.com/ROCm/pytorch.git \
&& git fetch rocm \
&& git cherry-pick 519160d466782f5a62365be051fcb3ef90fa0b00 \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leo-automation Do we need this as well?

Comment thread .ci/docker/pytorch-nightly-docker.Dockerfile
env:
ROCM_VERSION: '7.2.2'
PYTHON_VERSION: '3.10'
PYTORCH_ROCM_ARCH: 'gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201'
Copy link
Copy Markdown
Collaborator

@jithunnair-amd jithunnair-amd Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pruthvistony @jeffdaily @pragupta @jerrymannil @jataylo Since these images are primarily for dev use, do we really need to build for all archs by default? We can always build for a specific arch if needed, by manually triggering the workflow (would need to ensure that manual triggers use a different docker image tag so it doesn't overwrite the cron-based ones). I propose we build only for the most-commonly required GFX archs for dev use.
My proposal: "gfx90a gfx942 gfx950"
Any Radeon ones needed?

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 23, 2026

Jenkins build for 807c7a11d7e2982b3c5e653b91aa4b61b61879a2 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 24, 2026

Jenkins build for fb1c009dec41f8dd2083d9e28be2fb7714fe464e commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 24, 2026

Jenkins build for 20df855ecfe6114da39c5403157eba72e5f49853 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 24, 2026

Jenkins build for 20df855ecfe6114da39c5403157eba72e5f49853 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 27, 2026

Jenkins build for 077f47cf15cb4d2fb62b9f3fb8f7de17520b5650 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 27, 2026

Jenkins build for 012f03505ff5cafb0aceb085dcc30230540c51a9 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 28, 2026

Jenkins build for 88ec33078145d1c899001ecbc60a0bce195af050 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Detected error during Pytorch building:

[7710/8176] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/distributed/c10d/FlightRecorderCuda.cpp.o
cc1plus: warning: command-line option ‘-Wno-duplicate-decl-specifier’ is valid for C/ObjC but not for C++
[7711/8176] Building CXX object caffe2/CMakeFiles/stride_properties_test.dir/__/aten/src/ATen/test/stride_properties_test.cpp.o
[7712/8176] Building CXX object caffe2/CMakeFiles/cpu_profiling_allocator_test.dir/__/aten/src/ATen/test/cpu_profiling_allocator_test.cpp.o
[7713/8176] Building CXX object caffe2/CMakeFiles/memory_overlapping_test.dir/__/aten/src/ATen/test/memory_overlapping_test.cpp.o
FAILED: caffe2/CMakeFiles/memory_overlapping_test.dir/__/aten/src/ATen/test/memory_overlapping_test.cpp.o 
/opt/cache/bin/sccache /opt/cache/bin/c++ -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_POSIX_FALLOCATE=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DHIPBLASLT_USE_ROCROLLER -DIDEEP_USE_MKL -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DROCM_VERSION=70202 -DTORCH_HIP_VERSION=702 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_LAYERNORM_FAST_RECIPROCAL -DUSE_PROF_API=1 -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -I/var/lib/jenkins/pytorch/build/aten/src -I/var/lib/jenkins/pytorch/aten/src -I/var/lib/jenkins/pytorch/build -I/var/lib/jenkins/pytorch -I/var/lib/jenkins/pytorch/nlohmann -I/var/lib/jenkins/pytorch/moodycamel -I/var/lib/jenkins/pytorch/build/include -I/var/lib/jenkins/pytorch/build/caffe2/aten/src -I/var/lib/jenkins/pytorch/aten/src/ATen/.. -I/var/lib/jenkins/pytorch/third_party/miniz-3.0.2 -I/var/lib/jenkins/pytorch/torch/csrc/api -I/var/lib/jenkins/pytorch/torch/csrc/api/include -I/var/lib/jenkins/pytorch/c10/.. -I/var/lib/jenkins/pytorch/c10/hip/../.. -isystem /opt/rocm-7.2.2/include -isystem /var/lib/jenkins/pytorch/build/third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/gloo -isystem /var/lib/jenkins/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/googletest/googletest/include -isystem /var/lib/jenkins/pytorch/third_party/protobuf/src -isystem /opt/conda/envs/py_3.12/include -isystem /var/lib/jenkins/pytorch/third_party/XNNPACK/include -isystem /var/lib/jenkins/pytorch/third_party/ittapi/include -isystem /var/lib/jenkins/pytorch/cmake/../third_party/eigen -isystem /opt/rocm/include -isystem /var/lib/jenkins/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /var/lib/jenkins/pytorch/third_party/ideep/include -isystem /var/lib/jenkins/pytorch/INTERFACE -isystem /var/lib/jenkins/pytorch/third_party/nlohmann/include -isystem /var/lib/jenkins/pytorch/third_party/concurrentqueue -isystem /opt/rocm-7.2.2/include/hiprand -isystem /opt/rocm-7.2.2/include/rocrand -isystem /var/lib/jenkins/pytorch/third_party/googletest/googletest/include -isystem /var/lib/jenkins/pytorch/third_party/googletest/googletest -isystem /var/lib/jenkins/pytorch/third_party/googletest/googlemock/include -isystem /var/lib/jenkins/pytorch/third_party/googletest/googlemock -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOXPUPTI=ON -DUSE_FBGEMM -DUSE_MSLK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -DC10_NODEPRECATED -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-unused-parameter -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=old-style-cast -faligned-new -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-dangling-reference -Wno-error=dangling-reference -Wno-stringop-overflow -DHAVE_AVX512_CPU_DEFINITION -DHAVE_AVX2_CPU_DEFINITION -O3 -DNDEBUG -DNDEBUG -std=gnu++20 -fPIE -fdiagnostics-color=always -DMKL_HAS_SBGEMM -DMKL_HAS_SHGEMM -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -fPIC -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=702 -Wno-shift-count-negative -Wno-shift-count-overflow -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -DHIPBLAS_V2 -DHIP_ENABLE_WARP_SYNC_BUILTINS -DHIPBLASLT_OUTER_VEC -DUSE_ROCM_CK_GEMM -MD -MT caffe2/CMakeFiles/memory_overlapping_test.dir/__/aten/src/ATen/test/memory_overlapping_test.cpp.o -MF caffe2/CMakeFiles/memory_overlapping_test.dir/__/aten/src/ATen/test/memory_overlapping_test.cpp.o.d -o caffe2/CMakeFiles/memory_overlapping_test.dir/__/aten/src/ATen/test/memory_overlapping_test.cpp.o -c /var/lib/jenkins/pytorch/aten/src/ATen/test/memory_overlapping_test.cpp
thread 'main' panicked at 'failed to shut down worker thread', /root/.cargo/registry/src/github.com-1ecc6299db9ec823/jobserver-0.1.9/src/lib.rs:650:16
note: Run with `RUST_BACKTRACE=1` for a backtrace.
[7714/8176] Building CXX object caffe2/CMakeFiles/native_test.dir/__/aten/src/ATen/test/native_test.cpp.o
[7715/8176] Building CXX object caffe2/CMakeFiles/torch_hip.dir/__/torch/csrc/jit/tensorexpr/cuda_codegen.cpp.o

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 28, 2026

Jenkins build for 93b2d31c392008b0cfe4788a1a332f67553d4996 commit finished as NOT_BUILT
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 28, 2026

Jenkins build for 93b2d31c392008b0cfe4788a1a332f67553d4996 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@rocm-repo-management-api
Copy link
Copy Markdown

rocm-repo-management-api Bot commented Apr 29, 2026

Jenkins build for 7b8dd188ab294f8c1d36cb5b5b898fdadc06f7f9 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants