Skip to content

[ROCM] [CI] [ROCm] [CI] Gate the changes to Dockerfile.rocm_base#297

Open
tjtanaa wants to merge 30 commits intomainfrom
rocmnightly
Open

[ROCM] [CI] [ROCm] [CI] Gate the changes to Dockerfile.rocm_base#297
tjtanaa wants to merge 30 commits intomainfrom
rocmnightly

Conversation

@tjtanaa
Copy link
Copy Markdown
Collaborator

@tjtanaa tjtanaa commented Feb 28, 2026

Purpose

Current state:
The changes of Dockerfile.rocm_base is not immediately propagated to the rocm/vllm-dev:base. AMD has to test run internally before updating the rocm/vllm-dev:base.

This PR is to ensure that we are able to validate the changes of Dockerfile.rocm_base in real-time before a PR is merged.

Before the PR

Two disjoint steps

AMD Internal Pipeline (manual trigger):

  • Build Dockerfile.rocm_base → Push to rocm/vllm-dev:base

vLLM AMD CI:

  1. Pull base image from rocm/vllm-dev:base
  2. Build Dockerfile.rocm on top of base image
  3. Push to rocm/vllm-ci:<buildkite-commit>
  4. AMD CI pulls rocm/vllm-ci:<buildkite-commit> to run tests

Cons:

  • There is a lag, when the Dockerfile.rocm_base is updated in a PR, the vLLM AMD CI does not pick up the changes. So we are not able to validate the changes of Dockerfile.rocm_base with vLLM AMD CI before merging.
  • The propagation of the Dockerfile.rocm_base and vLLM AMD CI test has a time lag.

After this PR:

We know that PR that changes the Dockerfile.rocm_base has the following property

  • Slow changing, only changed once every 1 to 2 months

Based on this property, we will use CI action to make sure that the github history of Dockerfile.rocm_base in this PR contains the history of Dockerfile.rocm_base on main branch before we can merge the PR.

If we build the Dockerfile.rocm_base in the PR, we can pre-populate the sccache and the s3 cache (this s3 cache is created based on the hash key generated from Dockerfile.rocm_base content and build arguments). In this way, we can avoid rebuilding the Dockerfile.rocm_base in the per-commit release build on the main branch. On main branch, the Dockerfile.rocm_base will always be under 2 minutes as the cache has always been prepopulated in the PR.

The build of Dockerfile.rocm is always under 1 hr.

Single Pipeline

vLLM AMD CI:

  1. Check freshness of Dockerfile.rocm_base in the PR
  2. Check cache status for sccache and S3 cache (keyed by Dockerfile.rocm_base content hash + build args)
  3. Build base image if needed to pre-populate caches
  4. Build Dockerfile.rocm using cached base image from rocm/vllm-dev:base
  5. Push to rocm/vllm-ci:<buildkite-commit>
  6. AMD CI pulls rocm/vllm-ci:<buildkite-commit> to run tests

Technical details

image

Buildkite step: Check Dockerfile.rocm_base freshness

  • Uses: small_cpu_queue_premerge as this step only involves git clone and check for freshness of the Dockerfile.rocm_base . So I picked a less occupied queue

Buildkite step: AMD: :docker: Build/Reuse ROCm base image

  • Uses cpu_queue_postmerge because we need a CPU machine that has read write access to the ECR repository, s3 sccache and S3 bucket.
  • When cache is hit:
    docker buildx imagetools create --tag "$${ECR_COMMIT_TAG}" "$${ECR_CACHE_TAG}" which create a new ECR Tag without the need to pull the docker image from ECR repository.
  • If no cache hit: rebuild Dockerfile.rocm_base and then push the docker image and dependency wheels to s3 cache and ECR repository

Buildkite step: AMD: :docker: build image

  • Uses amd-cpu because we want to be able to push the AMD CI docker image to rocm/vllm-ci.

Test Plan

Test run the gating feature

Test Result

Failure case (validated): https://buildkite.com/vllm/amd-ci/builds/5307/steps/canvas?sid=019c8f0c-57c7-4e14-8f86-42c041629d55&tab=output

Success case: https://buildkite.com/vllm/amd-ci/builds/5608/steps/canvas

Follow up PR

  • Automatically release the base docker image to vllm/vllm-openai-rocm:base

This PR is part of the large plan to ship nightly docker image and rocm wheel:

Docker image and rocm wheel nightly plan

We know that PR that changes the Dockerfile.rocm_base has the following property

  • Slow changing, only changed once every 1 to 2 months

Based on this property, we will use CI action to make sure that the github history of Dockerfile.rocm_base in this PR contains the history of Dockerfile.rocm_base on main branch before we can merge the PR.

If we build the Dockerfile.rocm_base in the PR, we can pre-populate the sccache and the s3 cache (this s3 cache is created based on the hash key generated from Dockerfile.rocm_base content and build arguments). In this way, we can avoid rebuilding the Dockerfile.rocm_base in the per-commit release build on the main branch. On main branch, the Dockerfile.rocm_base will always be under 2 minutes as the cache has always been prepopulated in the PR.

The build of Dockerfile.rocm is always under 1 hr.

The benefits are:

Achieve the same releases as on CUDA, nightly docker and nightly developer wheel releases. (nightly wheels can be used to achieve the Python-only Installation)
The PR that changesDockerfile.rocm_base  will also use this new docker base to build the Dockerfile.rocm in our CI. This way we can test the Dockerfile.rocm_base when community tries to add new Arch support to the Dockerfile.rocm_base.
Since the build time has greatly decreases, we can also consider building docker image and wheels for different Python version and ROCm version.

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…base

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…load docker image to speed things up

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@tjtanaa tjtanaa marked this pull request as ready for review February 28, 2026 07:58
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
@AndreasKaratzas
Copy link
Copy Markdown
Collaborator

cc @gshtras

Copy link
Copy Markdown
Contributor

@gshtras gshtras left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely familiar with the build dispatching. But need to make absolute sure that this doesn't trigger more than is necessary. i.e. PRs only with ready label to not get the cpu machines accidentally DDoSed.

@tjtanaa
Copy link
Copy Markdown
Collaborator Author

tjtanaa commented Mar 10, 2026

@gshtras @Alexei-V-Ivanov-AMD could you take another look? Thanks

@gshtras
Copy link
Copy Markdown
Contributor

gshtras commented Mar 10, 2026

From what I see the nightly base build is still restricted to only 2 archs. Am I looking in the wrong place?

@tjtanaa
Copy link
Copy Markdown
Collaborator Author

tjtanaa commented Mar 11, 2026

From what I see the nightly base build is still restricted to only 2 archs. Am I looking in the wrong place?

The - label: "AMD: :docker: build image" step only builds docker image for AMD CI usage. We will be building nighly release docker image in the Release Pipeline (another buildkite pipeline).

docker build
--build-arg max_jobs=16
--build-arg REMOTE_VLLM=1
--build-arg ARG_PYTORCH_ROCM_ARCH='gfx90a;gfx942;gfx950'
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The arch is still restricted to these as it is an image for AMD CI.

tjtanaa added 2 commits March 23, 2026 15:45
…m_base_ecr_commit_tag

Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants