[ROCm][CI] Add ROCm Docker Hub registry cache and weekly cleanup pipeline#307
Open
AndreasKaratzas wants to merge 29 commits intomainfrom
Open
[ROCm][CI] Add ROCm Docker Hub registry cache and weekly cleanup pipeline#307AndreasKaratzas wants to merge 29 commits intomainfrom
AndreasKaratzas wants to merge 29 commits intomainfrom
Conversation
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
…line Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
… builds Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
Signed-off-by: Andreas Karatzas <akaratza@amd.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CI-infra-side changes to support the three-tier ROCm Docker build (vllm-project/vllm#36949). Wires the new
ci_baseTier-1 image into the bake HCL targets and Jinja pipeline template, and removes the now-unnecessary Docker Hub cache cleanup scripts.docker/ci-rocm.hcl — new variable
CI_BASE_IMAGEHCL variables are automatically read from the environment, so any env var named
CI_BASE_IMAGEset by the Buildkite step overrides this default at build time.The default
rocm/vllm-dev:ci_baseis the stable Docker Hub tag produced by the weekly scheduled pipeline. When building theci_basestage itself (via theci-base-rocm-citarget), this variable is passed as a--build-argbut has no effect — theci_basestage inDockerfile.rocmusesFROM base, notFROM ${CI_BASE_IMAGE}.docker/ci-rocm.hcl —
CI_BASE_IMAGEadded to_ci-rocmargstarget "_ci-rocm" { args = { ARG_PYTORCH_ROCM_ARCH = PYTORCH_ROCM_ARCH USE_SCCACHE = 0 + CI_BASE_IMAGE = CI_BASE_IMAGE } }_ci-rocmis the shared base target inherited by all CI bake targets (test-rocm-ci,test-rocm-gfx90a-ci, etc.). AddingCI_BASE_IMAGEhere means it is forwarded as--build-arg CI_BASE_IMAGE=<value>to Docker for every CI build automatically, without repeating it in each per-arch target.docker/ci-rocm.hcl — new
ci-base-rocm-cibake targetCI_BASE_IMAGE_TAGThe stable tag pushed after every weekly build:
rocm/vllm-dev:ci_base. Per-PR builds pull this tag as theirCI_BASE_IMAGE. The default here is the authoritative definition of the Tier-1 image name across the whole system.CI_BASE_IMAGE_TAG_DATEDAn optional dated snapshot tag (e.g.
rocm/vllm-dev:ci_base-20250330) set by theamd-ci-base.yamlpipeline at runtime via shell export. Empty by default socompact([...])drops it cleanly when running locally or in other contexts. Used for rollback: if a weekly build introduces a regression you can pinCI_BASE_IMAGEto a specific dated tag inamd.yaml.ci-base-rocm-citargetInvoked by
bash .buildkite/scripts/ci-bake.sh ci-base-rocm-ciin the weekly pipeline.Inherits from
_common-rocm(repo context, Dockerfile, build args),_ci-rocm(PYTORCH_ROCM_ARCH, USE_SCCACHE, CI_BASE_IMAGE arg), and_labels(OCI image labels).Uses the same
get_cache_from_rocm()/get_cache_to_rocm()functions as the per-PR test targets, so BuildKit reuses intermediate layer cache from the most recent main-branch build.buildkite/test-template-amd.j2 — docker pull + new env vars in all 4 AMD build steps
Applied identically to all four Jinja-generated build steps (all-archs, gfx90a, gfx942, gfx950). The rendered pipeline YAML is what Buildkite actually executes for every PR.
docker pull rocm/vllm-dev:ci_baseEagerly fetches the Tier-1 image before
docker buildx bakeruns. This makes the pull step visible in build logs with its own timing, and surfaces a missing Tier-1 image as an immediate and clear failure rather than a silent timeout inside the bake build.CI_BASE_IMAGE: "rocm/vllm-dev:ci_base"Picked up by
ci-rocm.hclas the HCLCI_BASE_IMAGEvariable (HCL auto-reads matching env vars). Forwarded to Docker as--build-arg CI_BASE_IMAGE=rocm/vllm-dev:ci_base.Dockerfile.rocmuses it inFROM ${CI_BASE_IMAGE} AS testso the test stage inherits from the pre-built Tier-1 registry image instead of rebuilding everything frombase.REMOTE_VLLM: "1"Dockerfile.rocmhas two source-fetching modes:0(default):COPY . /appfrom the local build context — used for local development.1(CI):git clone $VLLM_REPO --branch $VLLM_BRANCHinside the Docker build — used in CI where the build agent does not pass the full repo as build context todocker buildx bake.Without
REMOTE_VLLM=1,docker buildx bakein CI would try to COPY the repo from the bake build context. That checkout is used for running scripts but is not guaranteed to be the exact PR commit state that should be compiled into the image.VLLM_BRANCH: "$BUILDKITE_COMMIT"When
REMOTE_VLLM=1, the Dockerfile clones$VLLM_REPOand checks out$VLLM_BRANCH. Setting this to the exact commit SHA ($BUILDKITE_COMMIT, expanded by Buildkite at step evaluation time) ensures the image contains the exact state of the PR commit being tested — notmainand not a branch tip that may have moved since the build was triggered.Variables present in
ci-bake.shheader comments but NOT used in any ROCm step:VLLM_USE_PRECOMPILED— CUDA-only. In the CUDA pipeline, PyTorch and other heavy wheels are sometimes pre-compiled and cached by commit SHA to skip rebuilding on every PR. ROCm does not have this wheel cache infrastructure (no equivalent ofwheels.vllm.aifor ROCm builds), so this variable is never set in any AMD build step and has no effect here.VLLM_MERGE_BASE_COMMIT— The git merge-base of the PR branch withmain. Used as an additionalcache-fromfallback insideget_cache_from_rocm()inci-rocm.hcl. For long-lived PRs that have diverged far frommain, the parent-commit cache layer may be cold (the parent was never built on this agent), but the merge-base IS a realmain-branch commit whose layers are kept warm by main-branch builds.ci-bake.shauto-computes this viagit merge-base HEAD origin/mainif the env var is not already set. For ROCm it feeds into four ordered cache-from entries: exact commit → parent commit → merge-base →:rocm-latest.Deleted: buildkite/scripts/cleanup-dockerhub-rocm-cache.sh and buildkite/pipelines/cleanup-dockerhub-rocm-cache.yaml
These scripts periodically deleted old per-commit cache tags from
rocm/vllm-ci-cacheon Docker Hub to stay within storage limits. They are removed for two reasons:BuildKit's own cache eviction handles layer garbage collection at the storage backend level without the risks of tag-level deletion scripts.
This PR is connected to: vllm-project/vllm#36949
These two PRs should likely be merged simultaneously.
cc @kenroche @okakarpa @tjtanaa @khluu
Co-authored-by: Claude claude@anthropic.com