[ROCM] [CI] [ROCm] [CI] Gate the changes to Dockerfile.rocm_base#297
[ROCM] [CI] [ROCm] [CI] Gate the changes to Dockerfile.rocm_base#297
Dockerfile.rocm_base#297Conversation
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…base Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
…load docker image to speed things up Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
cc @gshtras |
gshtras
left a comment
There was a problem hiding this comment.
I'm not entirely familiar with the build dispatching. But need to make absolute sure that this doesn't trigger more than is necessary. i.e. PRs only with ready label to not get the cpu machines accidentally DDoSed.
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
|
@gshtras @Alexei-V-Ivanov-AMD could you take another look? Thanks |
|
From what I see the nightly base build is still restricted to only 2 archs. Am I looking in the wrong place? |
The |
| docker build | ||
| --build-arg max_jobs=16 | ||
| --build-arg REMOTE_VLLM=1 | ||
| --build-arg ARG_PYTORCH_ROCM_ARCH='gfx90a;gfx942;gfx950' |
There was a problem hiding this comment.
The arch is still restricted to these as it is an image for AMD CI.
…m_base_ecr_commit_tag Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Signed-off-by: tjtanaa <tunjian.tan@embeddedllm.com>
Purpose
Current state:
The changes of
Dockerfile.rocm_baseis not immediately propagated to therocm/vllm-dev:base. AMD has to test run internally before updating therocm/vllm-dev:base.This PR is to ensure that we are able to validate the changes of
Dockerfile.rocm_basein real-time before a PR is merged.Before the PR
Two disjoint steps
AMD Internal Pipeline (manual trigger):
Dockerfile.rocm_base→ Push torocm/vllm-dev:basevLLM AMD CI:
rocm/vllm-dev:baseDockerfile.rocmon top of base imagerocm/vllm-ci:<buildkite-commit>rocm/vllm-ci:<buildkite-commit>to run testsCons:
Dockerfile.rocm_baseis updated in a PR, the vLLM AMD CI does not pick up the changes. So we are not able to validate the changes ofDockerfile.rocm_basewith vLLM AMD CI before merging.Dockerfile.rocm_baseand vLLM AMD CI test has a time lag.After this PR:
We know that PR that changes the Dockerfile.rocm_base has the following property
Based on this property, we will use CI action to make sure that the github history of
Dockerfile.rocm_basein this PR contains the history ofDockerfile.rocm_baseon main branch before we can merge the PR.If we build the
Dockerfile.rocm_basein the PR, we can pre-populate the sccache and the s3 cache (this s3 cache is created based on the hash key generated fromDockerfile.rocm_basecontent and build arguments). In this way, we can avoid rebuilding theDockerfile.rocm_basein the per-commit release build on the main branch. On main branch, theDockerfile.rocm_basewill always be under 2 minutes as the cache has always been prepopulated in the PR.The build of
Dockerfile.rocmis always under 1 hr.Single Pipeline
vLLM AMD CI:
Dockerfile.rocm_basein the PRDockerfile.rocm_basecontent hash + build args)Dockerfile.rocmusing cached base image fromrocm/vllm-dev:baserocm/vllm-ci:<buildkite-commit>rocm/vllm-ci:<buildkite-commit>to run testsTechnical details
Buildkite step: Check
Dockerfile.rocm_basefreshnesssmall_cpu_queue_premergeas this step only involves git clone and check for freshness of theDockerfile.rocm_base. So I picked a less occupied queueBuildkite step: AMD: :docker: Build/Reuse ROCm base image
cpu_queue_postmergebecause we need a CPU machine that has read write access to the ECR repository, s3 sccache and S3 bucket.docker buildx imagetools create --tag "$${ECR_COMMIT_TAG}" "$${ECR_CACHE_TAG}"which create a new ECR Tag without the need to pull the docker image from ECR repository.Dockerfile.rocm_baseand then push the docker image and dependency wheels to s3 cache and ECR repositoryBuildkite step: AMD: :docker: build image
amd-cpubecause we want to be able to push the AMD CI docker image torocm/vllm-ci.Test Plan
Test run the gating feature
Test Result
Failure case (validated): https://buildkite.com/vllm/amd-ci/builds/5307/steps/canvas?sid=019c8f0c-57c7-4e14-8f86-42c041629d55&tab=output
Success case: https://buildkite.com/vllm/amd-ci/builds/5608/steps/canvas
Follow up PR
vllm/vllm-openai-rocm:baseThis PR is part of the large plan to ship nightly docker image and rocm wheel:
Docker image and rocm wheel nightly plan
We know that PR that changes the Dockerfile.rocm_base has the following property
Based on this property, we will use CI action to make sure that the github history of Dockerfile.rocm_base in this PR contains the history of Dockerfile.rocm_base on main branch before we can merge the PR.
If we build the Dockerfile.rocm_base in the PR, we can pre-populate the sccache and the s3 cache (this s3 cache is created based on the hash key generated from Dockerfile.rocm_base content and build arguments). In this way, we can avoid rebuilding the Dockerfile.rocm_base in the per-commit release build on the main branch. On main branch, the Dockerfile.rocm_base will always be under 2 minutes as the cache has always been prepopulated in the PR.
The build of Dockerfile.rocm is always under 1 hr.
The benefits are:
Achieve the same releases as on CUDA, nightly docker and nightly developer wheel releases. (nightly wheels can be used to achieve the Python-only Installation)
The PR that changesDockerfile.rocm_base will also use this new docker base to build the Dockerfile.rocm in our CI. This way we can test the Dockerfile.rocm_base when community tries to add new Arch support to the Dockerfile.rocm_base.
Since the build time has greatly decreases, we can also consider building docker image and wheels for different Python version and ROCm version.