-
-
Notifications
You must be signed in to change notification settings - Fork 12k
Optimize dockerfile #28823
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Optimize dockerfile #28823
Conversation
344ea47 to
28a3425
Compare
28a3425 to
b89d1f6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to optimize the Dockerfile by leveraging build cache mounts, combining RUN layers, and reordering instructions to improve layer caching. While most of the changes are beneficial for build performance and image size, I've identified two critical issues that could break the Docker build. One issue involves moving a COPY instruction for the source code to a point after it's needed by a build step, which will cause the build to fail. The second issue is an incorrect environment variable setup for a Python virtual environment in a stage where no such environment is created. My review includes specific comments and suggestions to address these critical problems.
docker/Dockerfile
Outdated
| COPY . . | ||
| ARG GIT_REPO_CHECK=0 | ||
| RUN --mount=type=bind,source=.git,target=.git \ | ||
| if [ "$GIT_REPO_CHECK" != "0" ]; then bash tools/check_repo.sh ; fi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The wheel build step that uses sccache (lines 186-206) requires the project's source code to run python3 setup.py bdist_wheel. By moving COPY . . to after this step, the build will fail when USE_SCCACHE=1 because essential files like setup.py will be missing. To fix this, the source code must be copied before both wheel build steps (the one with sccache and the one without). Please move these lines to before line 186.
💡 Codex ReviewLines 204 to 208 in 491dbd5
When ℹ️ About Codex in GitHubCodex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback". |
777fa4f to
af6226f
Compare
|
@wzshiming @chaunceyjiang @rzabarazesh @amrmahdi - any followup on this pr? do you have test results? |
|
This PR relies on vllm-project/ci-infra#212 which was just merged yesterday, I will continue to work on this |
af6226f to
d8b8db2
Compare
a88493d to
977bde2
Compare
|
Documentation preview: https://vllm--28823.org.readthedocs.build/en/28823/ |
7518e43 to
31ea3a8
Compare
wzshiming
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is almost done. I'll check how much the speed has increased in the case of cached
|
|
||
| RUN --mount=type=cache,target=/root/.cache/uv \ | ||
| RUN --mount=type=cache,target=/root/.cache/uv,sharing=locked \ | ||
| --mount=type=bind,source=.git,target=.git \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The docker build cache misses everything since the presence of the .git causes it to do so.
This depends on the setup feature that automatically acquires the current version via the setuptools_scm.get_version
It can be specified using SETUPTOOLS_SCM_PRETEND_VERSION, but the actual version number is necessary here
I suggest using the --build-arg to set SETUPTOOLS_SCM_PRETEND_VERSION on CI, and remove the .git mount, and If accepted. If this suggestion is accepted, I will follow up on it #30686.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please keep in mind that these images are used in both the CI pipeline and release pipeline. If you change the auto-detection to a build argument, you need to do it in the both pipeline config file / generator, e.g.:
docker build --build-arg WHEEL_VERSION=$(python3 -m setuptools_scm -f plain) ...
| fi | ||
| #################### WHEEL BUILD IMAGE #################### | ||
|
|
||
| #################### DEEPGEMM BUILD IMAGE #################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Build DeepDEEM as a separate stage to avoid cache misses
| RUN mkdir -p /tmp/deepgemm/dist && touch /tmp/deepgemm/dist/.deepgemm_skipped | ||
| #################### DEEPGEMM BUILD IMAGE #################### | ||
|
|
||
| #################### EXTENSION BUILD IMAGE #################### |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| # note that this uses vllm installed by `pip` | ||
| FROM vllm-base AS test | ||
|
|
||
| ADD . /vllm-workspace/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to end of this stage
| COPY examples examples | ||
| COPY benchmarks benchmarks | ||
| COPY ./vllm/collect_env.py . | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't seem to be used at this stage, so moved to vllm-openai-base
|
|
||
| # Install system dependencies and uv, then create Python virtual environment | ||
| RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \ | ||
| RUN --mount=type=cache,id=apt-build,target=/var/lib/apt/lists \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
base uses id=apt-build, and vllm-base uses id=apt-final, since they are based on different ubuntu versions.
| RUN --mount=type=cache,target=/root/.cache/uv \ | ||
| RUN --mount=type=cache,target=/root/.cache/pip \ | ||
| python3 -m pip install uv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is installing uv with pip, so the uv cache is invalid.
31ea3a8 to
1a52d5e
Compare
Signed-off-by: Shiming Zhang <[email protected]>
1a52d5e to
70eefee
Compare
|
@chaunceyjiang @rzabarazesh @amrmahdi - Please take a look if you have the time |
| COPY requirements/common.txt requirements/common.txt | ||
| COPY requirements/cuda.txt requirements/cuda.txt | ||
| RUN --mount=type=cache,target=/root/.cache/uv \ | ||
| --mount=type=bind,source=requirements/cuda.txt,target=requirements/cuda.txt,ro \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wzshiming how much optimization do you get from these changes to reference and read-only for requirements/*.txt files? I'm checking because I have a parallel set of changes that need to rewrite these files for PyTorch nightlies (remove torch, torchaudio, torchtext) and also be used for regular builds at https://github.com/vllm-project/vllm/pull/30443/changes. Seems like we'll have some conflict, but I love the performance improvements here. cc @atalman @huydhn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This optimization removes two layers from the image, which is useful for the vllm-base stage — the final artifacts.
However, the effect is not significant for the base stage.
I don't mind resolving the conflict, so feel free to merge them first.
|
Hi @wzshiming, thanks for this great work on optimizing the Dockerfile! I've been working on a similar optimization in a different PR: #30626 Since you've been focusing on cache mounts and bind mounts to avoid layer invalidation, I focused more on rearranging the layers to pre-install slow-changing dependencies in vllm-base before the vLLM wheel installation. This way, incremental builds with Python-only changes can skip these expensive layers entirely. Combining both approaches should give us the maximum benefit, lets coordinate the merge to avoid conflicts. |
I've had a look. The stage's dependencies have been changed nicely — it's very clear, but it seems that the position of the stage has been adjusted, so a conflict can't be avoided. However, I don't mind resolving the conflict, so feel free to merge them first. |
Thanks @wzshiming, FWIW with my PR the rebuilds with python only changes affecting the leaf stage now takes ~16 minutes, see https://buildkite.com/vllm/ci/builds/43561/steps/canvas?sid=019b207f-7cc2-4327-8035-07fa0e925428 |
|
This pull request has merge conflicts that must be resolved before it can be |
Fixes #28641
Docker build cache miss https://buildkite.com/vllm/ci/builds/42823/steps/canvas?sid=019b07e5-06f8-41b0-ba20-c96aaccbc5f2
Docker build cache hit https://buildkite.com/vllm/ci/builds/42827/steps/canvas?sid=019b0829-a02b-4fa7-81bf-f1fbed1da5b9
Compare the main branch
It looks like the size of the cache has increased a lot, and the network bandwidth seems to be insufficient.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.