Skip to content

[ROCm][CI] Add ROCm Docker Hub registry cache and weekly cleanup pipeline#307

Open
AndreasKaratzas wants to merge 29 commits intomainfrom
akaratza_optimize_docker_build
Open

[ROCm][CI] Add ROCm Docker Hub registry cache and weekly cleanup pipeline#307
AndreasKaratzas wants to merge 29 commits intomainfrom
akaratza_optimize_docker_build

Conversation

@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

@AndreasKaratzas AndreasKaratzas commented Mar 13, 2026

CI-infra-side changes to support the three-tier ROCm Docker build (vllm-project/vllm#36949). Wires the new ci_base Tier-1 image into the bake HCL targets and Jinja pipeline template, and removes the now-unnecessary Docker Hub cache cleanup scripts.


docker/ci-rocm.hcl — new variable CI_BASE_IMAGE
+variable "CI_BASE_IMAGE" {
+  default = "rocm/vllm-dev:ci_base"
+}

HCL variables are automatically read from the environment, so any env var named CI_BASE_IMAGE set by the Buildkite step overrides this default at build time.

The default rocm/vllm-dev:ci_base is the stable Docker Hub tag produced by the weekly scheduled pipeline. When building the ci_base stage itself (via the ci-base-rocm-ci target), this variable is passed as a --build-arg but has no effect — the ci_base stage in Dockerfile.rocm uses FROM base, not FROM ${CI_BASE_IMAGE}.

docker/ci-rocm.hcl — CI_BASE_IMAGE added to _ci-rocm args
 target "_ci-rocm" {
   args = {
     ARG_PYTORCH_ROCM_ARCH = PYTORCH_ROCM_ARCH
     USE_SCCACHE           = 0
+    CI_BASE_IMAGE         = CI_BASE_IMAGE
   }
 }

_ci-rocm is the shared base target inherited by all CI bake targets (test-rocm-ci, test-rocm-gfx90a-ci, etc.). Adding CI_BASE_IMAGE here means it is forwarded as --build-arg CI_BASE_IMAGE=<value> to Docker for every CI build automatically, without repeating it in each per-arch target.

docker/ci-rocm.hcl — new ci-base-rocm-ci bake target
variable "CI_BASE_IMAGE_TAG" {
  default = "rocm/vllm-dev:ci_base"
}

variable "CI_BASE_IMAGE_TAG_DATED" {
  default = ""
}

target "ci-base-rocm-ci" {
  inherits   = ["_common-rocm", "_ci-rocm", "_labels"]
  target     = "ci_base"
  cache-from = get_cache_from_rocm()
  cache-to   = get_cache_to_rocm()
  tags = compact([CI_BASE_IMAGE_TAG, CI_BASE_IMAGE_TAG_DATED])
  output = ["type=registry"]
}

CI_BASE_IMAGE_TAG
The stable tag pushed after every weekly build: rocm/vllm-dev:ci_base. Per-PR builds pull this tag as their CI_BASE_IMAGE. The default here is the authoritative definition of the Tier-1 image name across the whole system.

CI_BASE_IMAGE_TAG_DATED
An optional dated snapshot tag (e.g. rocm/vllm-dev:ci_base-20250330) set by the amd-ci-base.yaml pipeline at runtime via shell export. Empty by default so compact([...]) drops it cleanly when running locally or in other contexts. Used for rollback: if a weekly build introduces a regression you can pin CI_BASE_IMAGE to a specific dated tag in amd.yaml.

ci-base-rocm-ci target
Invoked by bash .buildkite/scripts/ci-bake.sh ci-base-rocm-ci in the weekly pipeline.
Inherits from _common-rocm (repo context, Dockerfile, build args), _ci-rocm (PYTORCH_ROCM_ARCH, USE_SCCACHE, CI_BASE_IMAGE arg), and _labels (OCI image labels).
Uses the same get_cache_from_rocm() / get_cache_to_rocm() functions as the per-PR test targets, so BuildKit reuses intermediate layer cache from the most recent main-branch build.

buildkite/test-template-amd.j2 — docker pull + new env vars in all 4 AMD build steps
+  - docker pull rocm/vllm-dev:ci_base
   - bash .buildkite/scripts/ci-bake.sh test-rocm-*-ci
 env:
+  CI_BASE_IMAGE: "rocm/vllm-dev:ci_base"
+  REMOTE_VLLM: "1"
+  VLLM_BRANCH: "$BUILDKITE_COMMIT"

Applied identically to all four Jinja-generated build steps (all-archs, gfx90a, gfx942, gfx950). The rendered pipeline YAML is what Buildkite actually executes for every PR.

docker pull rocm/vllm-dev:ci_base
Eagerly fetches the Tier-1 image before docker buildx bake runs. This makes the pull step visible in build logs with its own timing, and surfaces a missing Tier-1 image as an immediate and clear failure rather than a silent timeout inside the bake build.

CI_BASE_IMAGE: "rocm/vllm-dev:ci_base"
Picked up by ci-rocm.hcl as the HCL CI_BASE_IMAGE variable (HCL auto-reads matching env vars). Forwarded to Docker as --build-arg CI_BASE_IMAGE=rocm/vllm-dev:ci_base. Dockerfile.rocm uses it in FROM ${CI_BASE_IMAGE} AS test so the test stage inherits from the pre-built Tier-1 registry image instead of rebuilding everything from base.

REMOTE_VLLM: "1"
Dockerfile.rocm has two source-fetching modes:

  • 0 (default): COPY . /app from the local build context — used for local development.
  • 1 (CI): git clone $VLLM_REPO --branch $VLLM_BRANCH inside the Docker build — used in CI where the build agent does not pass the full repo as build context to docker buildx bake.

Without REMOTE_VLLM=1, docker buildx bake in CI would try to COPY the repo from the bake build context. That checkout is used for running scripts but is not guaranteed to be the exact PR commit state that should be compiled into the image.

VLLM_BRANCH: "$BUILDKITE_COMMIT"
When REMOTE_VLLM=1, the Dockerfile clones $VLLM_REPO and checks out $VLLM_BRANCH. Setting this to the exact commit SHA ($BUILDKITE_COMMIT, expanded by Buildkite at step evaluation time) ensures the image contains the exact state of the PR commit being tested — not main and not a branch tip that may have moved since the build was triggered.


Variables present in ci-bake.sh header comments but NOT used in any ROCm step:

VLLM_USE_PRECOMPILED — CUDA-only. In the CUDA pipeline, PyTorch and other heavy wheels are sometimes pre-compiled and cached by commit SHA to skip rebuilding on every PR. ROCm does not have this wheel cache infrastructure (no equivalent of wheels.vllm.ai for ROCm builds), so this variable is never set in any AMD build step and has no effect here.

VLLM_MERGE_BASE_COMMIT — The git merge-base of the PR branch with main. Used as an additional cache-from fallback inside get_cache_from_rocm() in ci-rocm.hcl. For long-lived PRs that have diverged far from main, the parent-commit cache layer may be cold (the parent was never built on this agent), but the merge-base IS a real main-branch commit whose layers are kept warm by main-branch builds. ci-bake.sh auto-computes this via git merge-base HEAD origin/main if the env var is not already set. For ROCm it feeds into four ordered cache-from entries: exact commit → parent commit → merge-base → :rocm-latest.

Deleted: buildkite/scripts/cleanup-dockerhub-rocm-cache.sh and buildkite/pipelines/cleanup-dockerhub-rocm-cache.yaml

These scripts periodically deleted old per-commit cache tags from rocm/vllm-ci-cache on Docker Hub to stay within storage limits. They are removed for two reasons:

  1. Docker Hub storage is no longer a constraint for this project.
  2. The deletion logic operated on tag prefixes and age thresholds that could in edge cases delete cache tags still referenced by an in-flight build, causing spurious cache misses or build failures.

BuildKit's own cache eviction handles layer garbage collection at the storage backend level without the risks of tag-level deletion scripts.


This PR is connected to: vllm-project/vllm#36949
These two PRs should likely be merged simultaneously.

cc @kenroche @okakarpa @tjtanaa @khluu

Co-authored-by: Claude claude@anthropic.com

…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
@AndreasKaratzas AndreasKaratzas marked this pull request as draft March 13, 2026 18:27
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… builds

Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant