Skip to content
Open
Show file tree
Hide file tree
Changes from 41 commits
Commits
Show all changes
53 commits
Select commit Hold shift + click to select a range
507fd86
First draft
leo-automation Mar 31, 2026
59590af
Pass artfiact and test before pushing image
leo-automation Mar 31, 2026
ef2aefd
Matrix
leo-automation Apr 1, 2026
fc2c16d
Remove notifications for now
leo-automation Apr 1, 2026
2a6a9e6
Temporary trigger
leo-automation Apr 1, 2026
a7223b2
Self hosted runners smoke test
leo-automation Apr 1, 2026
c818f73
Delete smokes
leo-automation Apr 1, 2026
454a5fa
Verboose and runner change
leo-automation Apr 2, 2026
03393e9
Updae dockerfile
leo-automation Apr 2, 2026
551548d
Remove tty
leo-automation Apr 2, 2026
e53f83c
Use older buildx with better build logging
leo-automation Apr 8, 2026
2e68950
Verboose image build troublshooting
leo-automation Apr 8, 2026
1784746
Debug
leo-automation Apr 9, 2026
28cbb19
More logging
leo-automation Apr 9, 2026
ffdf09a
FIx permissions and have main jib disable sccache
leo-automation Apr 9, 2026
865f60d
Debug
leo-automation Apr 9, 2026
8711232
Fix debug script
leo-automation Apr 14, 2026
9b82418
Debug script fix
leo-automation Apr 14, 2026
1fbfc09
Implement Jithun's suggestions
leo-automation Apr 16, 2026
d74afa3
Updated timeout
leo-automation Apr 16, 2026
717a478
Remove debug
leo-automation Apr 17, 2026
2cd758b
pin sscache version
leo-automation Apr 17, 2026
c28afaa
Debug
leo-automation Apr 17, 2026
66ffb00
buildx fix
leo-automation Apr 17, 2026
554e8f5
Debug buildx
leo-automation Apr 20, 2026
bb8a72d
sscache version change
leo-automation Apr 20, 2026
a3fb579
Pin upstream commit
leo-automation Apr 20, 2026
cd7374c
sed on build and docker commit fix
leo-automation Apr 21, 2026
3b901b6
cmake deps
leo-automation Apr 21, 2026
f9c83ca
Disable rocSHMEM
leo-automation Apr 21, 2026
2c7f9b9
Remove push
leo-automation Apr 22, 2026
3525232
Remove some debugging
leo-automation Apr 22, 2026
8fc34b8
Enable for debug
leo-automation Apr 22, 2026
1324872
Disable USE_NVSHMEM
leo-automation Apr 22, 2026
eb32e63
Enable image push
leo-automation Apr 22, 2026
cd940fe
failed to read dockerfile
leo-automation Apr 22, 2026
6df0761
path fix
leo-automation Apr 22, 2026
cd81668
path fix
leo-automation Apr 22, 2026
7fd94cb
Bypass sccache on torch_rocshmem
leo-automation Apr 22, 2026
e74bf12
Upgrade actioms versions
leo-automation Apr 22, 2026
8c25b4c
Trivy vuln image scan
leo-automation Apr 22, 2026
758f32b
All in one job
leo-automation Apr 23, 2026
af18af1
try 7.2.0
leo-automation Apr 23, 2026
807c7a1
7.2
leo-automation Apr 23, 2026
391c1d3
Bypass sschache on rochsmem torch target
leo-automation Apr 24, 2026
fb1c009
Remove cherry pick
leo-automation Apr 24, 2026
20df855
sscacge workaround
leo-automation Apr 24, 2026
077f47c
Address comments
leo-automation Apr 27, 2026
012f035
Trivy increase context size
leo-automation Apr 27, 2026
88ec330
Try removing use_preprocessor_cache_mode from sccache
leo-automation Apr 28, 2026
7b8dd18
Cleanup
leo-automation Apr 28, 2026
0fe733e
Add a FIXME
jithunnair-amd May 5, 2026
202d0c5
Address comments
leo-automation May 6, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions .ci/docker/pytorch-nightly-docker.Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
ARG BASE_IMAGE=rocm/pytorch-autobuild:base-latest
FROM ${BASE_IMAGE}
WORKDIR /tmp
Comment thread
leo-automation marked this conversation as resolved.
Outdated
USER root

ENV CI=1
ENV PYTORCH_TEST_WITH_ROCM=1
ENV PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda"
ENV USE_NVSHMEM=0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leo-automation Please add a comment stating that this is TODO and TEMPORARY and a reason why it's there


RUN git clone https://github.com/pytorch/pytorch --recursive \
&& cd pytorch \
# Bypass sccache on torch_rocshmem: its -fgpu-rdc + mixed xnack± offload-arch flags break sccache's argv parser.
&& sed -i 's|set_target_properties(torch_rocshmem PROPERTIES LINKER_LANGUAGE HIP)|set_target_properties(torch_rocshmem PROPERTIES LINKER_LANGUAGE HIP CXX_COMPILER_LAUNCHER "" HIP_COMPILER_LAUNCHER "")|' caffe2/CMakeLists.txt \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leo-automation Do we still need this if we have the USE_NVSHMEM=0 above? If not, we could leave this in a comment if you don't want to lose it

&& pip install -r requirements.txt \
&& git config --local user.name "AMD AMD" \
&& git config --local user.email "amd@amd.com" \
&& git remote add rocm https://github.com/ROCm/pytorch.git \
&& git fetch rocm \
&& git cherry-pick 519160d466782f5a62365be051fcb3ef90fa0b00 \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@leo-automation Do we need this as well?

&& (.ci/pytorch/build.sh > /tmp/build.log 2>&1 || (tail -300 /tmp/build.log; exit 1)) \
&& rm -rf /tmp/pytorch/.git
RUN git clone https://github.com/pytorch/vision \
Comment thread
leo-automation marked this conversation as resolved.
&& cd vision \
&& FORCE_CUDA=1 python setup.py install \
&& rm -rf /tmp/vision/.git
RUN git clone https://github.com/pytorch/audio \
&& cd audio \
&& python setup.py install \
&& rm -rf /tmp/audio/.git
170 changes: 170 additions & 0 deletions .github/workflows/pytorch-nightly-docker.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,170 @@
name: ROCm Nightly Build and Test

on:
schedule:
# Run nightly at 2 AM UTC
- cron: '0 2 * * *'
workflow_dispatch:
inputs:
rocm_version:
description: ROCm version to build
required: false
type: string
workflow_call:
inputs:
rocm_version:
required: false
type: string
push:
branches:
- rocm-nightly-gha

env:
ROCM_VERSION: '7.2.2'
PYTHON_VERSION: '3.10'
PYTORCH_ROCM_ARCH: 'gfx906;gfx908;gfx90a;gfx942;gfx950;gfx1030;gfx1100;gfx1101;gfx1102;gfx1150;gfx1151;gfx1200;gfx1201'
Comment thread
jithunnair-amd marked this conversation as resolved.
DOCKER_REGISTRY: rocm/pytorch-nightly

jobs:
build:
name: Build ROCm Nightly Image
runs-on: linux-pytorch-mi325-1
timeout-minutes: 720
outputs:
full-image: ${{ steps.meta.outputs.full-image }}
steps:
- name: Resolve ROCm version
if: ${{ inputs.rocm_version != '' }}
run: echo "ROCM_VERSION=${{ inputs.rocm_version }}" >> "$GITHUB_ENV"

- name: Checkout pytorch
uses: actions/checkout@v6
with:
repository: pytorch/pytorch
ref: main

- name: Checkout nightly workflow files
uses: actions/checkout@v6
with:
path: rocm-nightly-workflow

- name: Patch rocm-n build.sh version
run: |
sed -i '/pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-jammy-rocm-n-py3-benchmarks | pytorch-linux-noble-rocm-n-py3)/,/;;/ s/ROCM_VERSION=7\.2/ROCM_VERSION=${{ env.ROCM_VERSION }}/' .ci/docker/build.sh
sed -n '/pytorch-linux-jammy-rocm-n-py3 | pytorch-linux-jammy-rocm-n-py3-benchmarks | pytorch-linux-noble-rocm-n-py3)/,/;;/p' .ci/docker/build.sh

- name: Generate image tag
id: meta
run: |
tag="$(date +%Y%m%d%H%M%S)-rocm${{ env.ROCM_VERSION }}"
Comment thread
leo-automation marked this conversation as resolved.
Outdated
echo "full-image=${{ env.DOCKER_REGISTRY }}:${tag}" >> "$GITHUB_OUTPUT"

- name: Build base image
working-directory: .ci/docker
run: |
export SKIP_SCCACHE_INSTALL=1
Comment thread
leo-automation marked this conversation as resolved.
Outdated
export PYTORCH_ROCM_ARCH="${{ env.PYTORCH_ROCM_ARCH }}"
./build.sh pytorch-linux-jammy-rocm-n-py3 \
-t rocm/pytorch-autobuild:base-latest

- name: Build ROCm Nightly Image
env:
FULL_IMAGE: ${{ steps.meta.outputs.full-image }}
run: |
docker build \
--build-arg BASE_IMAGE=rocm/pytorch-autobuild:base-latest \
-t "$FULL_IMAGE" \
- < rocm-nightly-workflow/.ci/docker/pytorch-nightly-docker.Dockerfile

- name: Save nightly image artifact
env:
FULL_IMAGE: ${{ steps.meta.outputs.full-image }}
run: |
docker save -o nightly-image.tar "$FULL_IMAGE"

- name: Upload nightly image artifact
uses: actions/upload-artifact@v7
with:
name: rocm-nightly-image
path: nightly-image.tar
retention-days: 1
compression-level: 0

test-push:
name: ${{ matrix.target.name }}
needs: build
strategy:
fail-fast: false
matrix:
target:
- name: Test and Push ROCm Nightly Image on MI325
runner: linux-pytorch-mi325-1
push_image: true
- name: Test ROCm Nightly Image on MI250
runner: linux-pytorch-mi250-1
push_image: false
runs-on: ${{ matrix.target.runner }}
timeout-minutes: 720
env:
NIGHTLY_IMAGE: ${{ needs.build.outputs.full-image }}
steps:
- name: Resolve ROCm version
if: ${{ inputs.rocm_version != '' }}
run: echo "ROCM_VERSION=${{ inputs.rocm_version }}" >> "$GITHUB_ENV"

- name: Docker cleanup
run: |
docker container prune -f
docker image prune -f

- name: Download nightly image artifact
uses: actions/download-artifact@v8
with:
name: rocm-nightly-image
path: nightly-image-artifact

- name: Load nightly image
run: docker load -i nightly-image-artifact/nightly-image.tar

- name: Run unit tests
run: |
docker run --rm \
--device=/dev/kfd \
--device=/dev/dri \
--group-add video \
--network host \
--cap-add=SYS_PTRACE \
--security-opt seccomp=unconfined \
"$NIGHTLY_IMAGE" \
bash -c "
git clone https://github.com/ROCm/pytorch-micro-benchmarking.git /tmp/pytorch-micro-benchmarking
cd /tmp/pytorch-micro-benchmarking
python3 micro_benchmarking_pytorch.py --network resnet50
"

- name: Scan image for vulnerabilities
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ethanwee1 Can you please add this to our theRock docker image build workflow? cc @okakarpa

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if: ${{ matrix.target.push_image }}
uses: aquasecurity/trivy-action@v0.36.0
with:
image-ref: ${{ env.NIGHTLY_IMAGE }}
format: table
severity: CRITICAL
ignore-unfixed: true
exit-code: '1'

- name: Log in to Docker Hub
if: ${{ matrix.target.push_image }}
uses: docker/login-action@v4
with:
username: ${{ secrets.DOCKER_USERNAME }}
password: ${{ secrets.DOCKER_PASSWORD }}
Comment thread
leo-automation marked this conversation as resolved.
Outdated
Comment thread
leo-automation marked this conversation as resolved.
Outdated

- name: Push validated image
if: ${{ matrix.target.push_image }}
env:
FINAL_IMAGE: ${{ needs.build.outputs.full-image }}
LATEST_IMAGE: ${{ env.DOCKER_REGISTRY }}:latest
run: |
docker tag "$FINAL_IMAGE" "$LATEST_IMAGE"
docker push "$FINAL_IMAGE"
docker push "$LATEST_IMAGE"
Loading