Skip to content

[ci] Remove ROCm from docker image#976

Open
mgehre-amd wants to merge 2 commits into
gfx11from
matthias.gfx11-ci-image-no-rocm-sdk
Open

[ci] Remove ROCm from docker image#976
mgehre-amd wants to merge 2 commits into
gfx11from
matthias.gfx11-ci-image-no-rocm-sdk

Conversation

@mgehre-amd
Copy link
Copy Markdown

@mgehre-amd mgehre-amd commented May 27, 2026

The ghcr.io/rocm/vllm/gfx11-ci includes ROCm and pytorch, but CI jobs upgrade to latest version anyways.

Move ROCm SDK and PyTorch install out of the Dockerfile and into the CI workflows. The build job already installed rocm[devel,libraries] + ran rocm-sdk init at runtime; the test job now installs rocm[libraries] from the same nightly index (no hipcc, no init needed).

To avoid drift between the two jobs' torch/torchvision/torchaudio +rocm pins, the build job now stashes its resolved constraints.txt into the wheel artifact and the test job consumes it directly.

The ghcr.io/rocm/vllm/gfx11-ci image bundled _rocm_sdk_devel (including
rocsolver/rocprofiler test fixtures) and PyTorch, ballooning past the
GitHub-hosted runner's free disk (~14 GB). docker pull failed with
"no space left on device" mid-extract, breaking build-wheel.

Move ROCm SDK and PyTorch install out of the Dockerfile and into the CI
workflows. The build job already installed rocm[devel,libraries] + ran
rocm-sdk init at runtime; the test job now installs rocm[libraries] from
the same nightly index (no hipcc, no init needed).

To avoid drift between the two jobs' torch/torchvision/torchaudio +rocm
pins, the build job now stashes its resolved constraints.txt into the
wheel artifact and the test job consumes it directly.

Changes:
- docker/Dockerfile.gfx11-ci: drop rocm[devel,libraries] + rocm-sdk init,
  drop torch pre-install, drop now-unused ARGs and hipcc verification.
  Keep _rocm_sdk_devel env-var pointers so runtime install lands where
  downstream expects.
- .github/workflows/build-gfx11-ci-image.yml: drop dead workflow inputs
  and build-args.
- .github/workflows/build-rocm-wheels.yml: copy constraints.txt into
  dist/ and include it in the uploaded wheel artifact.
- .github/workflows/test-rocm-kernels.yml: drop the duplicate
  constraints.txt derivation, consume dist/constraints.txt from the
  artifact, install rocm[libraries] alongside the wheel.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
The variable pointed into _rocm_sdk_core/lib/llvm/amdgcn/bitcode which
isn't populated until rocm-sdk init runs. hipcc / clang find the device
libs on their own once ROCM_PATH is set, so this export is dead weight.

Signed-off-by: Matthias Gehre <matthias.gehre@amd.com>
@mgehre-amd mgehre-amd changed the title [ci] Fix out-of-disk failure by reducing docker image size [ci] Remove ROCm from docker image May 27, 2026
# runtime; these paths point to where the pip wheel will land so downstream
# steps see hipcc/amd_smi without an extra export.
ENV ROCM_PATH=/usr/local/lib/python3.12/site-packages/_rocm_sdk_devel
ENV HIP_DEVICE_LIB_PATH=/usr/local/lib/python3.12/site-packages/_rocm_sdk_core/lib/llvm/amdgcn/bitcode
Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a workaround that has since been fixed in ROCm

Comment on lines 63 to 64
- name: GPU sanity check
run: |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the diagnostic info in this section will not be printed before ROCm is installed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants