Optimize dockerfile #28823

wzshiming · 2025-11-17T01:35:01Z

Fixes #28641

Docker build cache miss https://buildkite.com/vllm/ci/builds/42823/steps/canvas?sid=019b07e5-06f8-41b0-ba20-c96aaccbc5f2
- pull cache/image and build (about 29min)
- push cache/image (about 12min)
Docker build cache hit https://buildkite.com/vllm/ci/builds/42827/steps/canvas?sid=019b0829-a02b-4fa7-81bf-f1fbed1da5b9
- pull cache/image and build (about 15min)
- push cache/image (about 13min)
- As mentioned by Optimize dockerfile #28823 (comment) , the speed can be further improved.
Compare the main branch
- pull cache/image and build (about 28min)
- push cache/image (about 4min)
It looks like the size of the cache has increased a lot, and the network bandwidth seems to be insufficient.
- Currently, it is cached in public.ecr.aws. it's possible to switch to another internal network storage

Purpose

Test Plan

Test Result

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

gemini-code-assist

Code Review

This pull request aims to optimize the Dockerfile by leveraging build cache mounts, combining RUN layers, and reordering instructions to improve layer caching. While most of the changes are beneficial for build performance and image size, I've identified two critical issues that could break the Docker build. One issue involves moving a COPY instruction for the source code to a point after it's needed by a build step, which will cause the build to fail. The second issue is an incorrect environment variable setup for a Python virtual environment in a stage where no such environment is created. My review includes specific comments and suggestions to address these critical problems.

gemini-code-assist · 2025-11-17T01:36:48Z

docker/Dockerfile

+COPY . .
+ARG GIT_REPO_CHECK=0
+RUN --mount=type=bind,source=.git,target=.git \
+    if [ "$GIT_REPO_CHECK" != "0" ]; then bash tools/check_repo.sh ; fi


The wheel build step that uses sccache (lines 186-206) requires the project's source code to run python3 setup.py bdist_wheel. By moving COPY . . to after this step, the build will fail when USE_SCCACHE=1 because essential files like setup.py will be missing. To fix this, the source code must be copied before both wheel build steps (the one with sccache and the one without). Please move these lines to before line 186.

docker/Dockerfile

chatgpt-codex-connector · 2025-11-17T01:37:28Z

💡 Codex Review

vllm/docker/Dockerfile

Lines 204 to 208 in 491dbd5

    
                   && python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \ 
        
                   && sccache --show-stats; \ 
        
               fi 
        
           COPY . .

Copy sources before sccache build

When USE_SCCACHE=1, the build stage now runs python3 setup.py bdist_wheel before any project files are copied into the image. With no setup.py or source tree present, that branch will fail immediately, so sccache-accelerated builds can no longer succeed. Copying the workspace has to happen before invoking the wheel build in this branch.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

yeqcharlotte · 2025-12-05T05:54:13Z

@wzshiming @chaunceyjiang @rzabarazesh @amrmahdi - any followup on this pr? do you have test results?

cc: @robertgshaw2-redhat

wzshiming · 2025-12-05T07:58:39Z

This PR relies on vllm-project/ci-infra#212 which was just merged yesterday, I will continue to work on this

mergify · 2025-12-10T02:42:06Z

Documentation preview: https://vllm--28823.org.readthedocs.build/en/28823/

wzshiming

This is almost done. I'll check how much the speed has increased in the case of cached

wzshiming · 2025-12-10T10:10:38Z

docker/Dockerfile


-RUN --mount=type=cache,target=/root/.cache/uv \
+RUN --mount=type=cache,target=/root/.cache/uv,sharing=locked \
    --mount=type=bind,source=.git,target=.git \


The docker build cache misses everything since the presence of the .git causes it to do so.

This depends on the setup feature that automatically acquires the current version via the setuptools_scm.get_version

It can be specified using SETUPTOOLS_SCM_PRETEND_VERSION, but the actual version number is necessary here

I suggest using the --build-arg to set SETUPTOOLS_SCM_PRETEND_VERSION on CI, and remove the .git mount, and If accepted. If this suggestion is accepted, I will follow up on it #30686.

Please keep in mind that these images are used in both the CI pipeline and release pipeline. If you change the auto-detection to a build argument, you need to do it in the both pipeline config file / generator, e.g.:

docker build --build-arg WHEEL_VERSION=$(python3 -m setuptools_scm -f plain) ...

wzshiming · 2025-12-10T10:19:36Z

docker/Dockerfile

+    fi
+#################### WHEEL BUILD IMAGE ####################
+
+#################### DEEPGEMM BUILD IMAGE ####################


Build DeepDEEM as a separate stage to avoid cache misses

wzshiming · 2025-12-10T10:19:59Z

docker/Dockerfile

 RUN mkdir -p /tmp/deepgemm/dist && touch /tmp/deepgemm/dist/.deepgemm_skipped
+#################### DEEPGEMM BUILD IMAGE ####################
+
+#################### EXTENSION BUILD IMAGE ####################


wzshiming · 2025-12-10T10:24:09Z

docker/Dockerfile

 # note that this uses vllm installed by `pip`
 FROM vllm-base AS test

-ADD . /vllm-workspace/


Moved to end of this stage

wzshiming · 2025-12-10T10:53:36Z

docker/Dockerfile

-COPY examples examples
-COPY benchmarks benchmarks
-COPY ./vllm/collect_env.py .
-


It doesn't seem to be used at this stage, so moved to vllm-openai-base

wzshiming · 2025-12-10T11:12:16Z

docker/Dockerfile


 # Install system dependencies and uv, then create Python virtual environment
-RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
+RUN --mount=type=cache,id=apt-build,target=/var/lib/apt/lists \


base uses id=apt-build, and vllm-base uses id=apt-final, since they are based on different ubuntu versions.

wzshiming · 2025-12-10T11:18:02Z

docker/Dockerfile

-RUN --mount=type=cache,target=/root/.cache/uv \
+RUN --mount=type=cache,target=/root/.cache/pip \
    python3 -m pip install uv


This is installing uv with pip, so the uv cache is invalid.

Signed-off-by: Shiming Zhang <[email protected]>

wzshiming · 2025-12-10T12:53:19Z

@chaunceyjiang @rzabarazesh @amrmahdi - Please take a look if you have the time

orionr · 2025-12-12T17:36:03Z

docker/Dockerfile

-COPY requirements/common.txt requirements/common.txt
-COPY requirements/cuda.txt requirements/cuda.txt
 RUN --mount=type=cache,target=/root/.cache/uv \
+    --mount=type=bind,source=requirements/cuda.txt,target=requirements/cuda.txt,ro \


@wzshiming how much optimization do you get from these changes to reference and read-only for requirements/*.txt files? I'm checking because I have a parallel set of changes that need to rewrite these files for PyTorch nightlies (remove torch, torchaudio, torchtext) and also be used for regular builds at https://github.com/vllm-project/vllm/pull/30443/changes. Seems like we'll have some conflict, but I love the performance improvements here. cc @atalman @huydhn

This optimization removes two layers from the image, which is useful for the vllm-base stage — the final artifacts.

However, the effect is not significant for the base stage.

I don't mind resolving the conflict, so feel free to merge them first.

amrmahdi · 2025-12-13T21:15:05Z

Hi @wzshiming, thanks for this great work on optimizing the Dockerfile!

I've been working on a similar optimization in a different PR: #30626

Since you've been focusing on cache mounts and bind mounts to avoid layer invalidation, I focused more on rearranging the layers to pre-install slow-changing dependencies in vllm-base before the vLLM wheel installation. This way, incremental builds with Python-only changes can skip these expensive layers entirely.

Combining both approaches should give us the maximum benefit, lets coordinate the merge to avoid conflicts.

wzshiming · 2025-12-15T03:33:40Z

Hi @wzshiming, thanks for this great work on optimizing the Dockerfile!

I've been working on a similar optimization in a different PR: #30626

Since you've been focusing on cache mounts and bind mounts to avoid layer invalidation, I focused more on rearranging the layers to pre-install slow-changing dependencies in vllm-base before the vLLM wheel installation. This way, incremental builds with Python-only changes can skip these expensive layers entirely.

Combining both approaches should give us the maximum benefit, lets coordinate the merge to avoid conflicts.

I've had a look. The stage's dependencies have been changed nicely — it's very clear, but it seems that the position of the stage has been adjusted, so a conflict can't be avoided.

However, I don't mind resolving the conflict, so feel free to merge them first.

amrmahdi · 2025-12-15T06:00:35Z

Hi @wzshiming, thanks for this great work on optimizing the Dockerfile!
I've been working on a similar optimization in a different PR: #30626
Since you've been focusing on cache mounts and bind mounts to avoid layer invalidation, I focused more on rearranging the layers to pre-install slow-changing dependencies in vllm-base before the vLLM wheel installation. This way, incremental builds with Python-only changes can skip these expensive layers entirely.
Combining both approaches should give us the maximum benefit, lets coordinate the merge to avoid conflicts.

I've had a look. The stage's dependencies have been changed nicely — it's very clear, but it seems that the position of the stage has been adjusted, so a conflict can't be avoided.

However, I don't mind resolving the conflict, so feel free to merge them first.

Thanks @wzshiming, FWIW with my PR the rebuilds with python only changes affecting the leaf stage now takes ~16 minutes, see https://buildkite.com/vllm/ci/builds/43561/steps/canvas?sid=019b207f-7cc2-4327-8035-07fa0e925428

mergify · 2025-12-16T06:22:55Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wzshiming.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

wzshiming force-pushed the optimize/dockerfile branch from 344ea47 to 28a3425 Compare November 17, 2025 01:35

mergify bot added the ci/build label Nov 17, 2025

wzshiming force-pushed the optimize/dockerfile branch 2 times, most recently from 28a3425 to b89d1f6 Compare November 17, 2025 01:36

gemini-code-assist bot reviewed Nov 17, 2025

View reviewed changes

chaunceyjiang self-assigned this Nov 17, 2025

chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 17, 2025

wzshiming force-pushed the optimize/dockerfile branch 11 times, most recently from 777fa4f to af6226f Compare November 17, 2025 09:56

pacoxu mentioned this pull request Nov 28, 2025

[CI/Build] Replace COPY scripts with bind mounts to reduce layers #29201

Open

5 tasks

wzshiming closed this Dec 5, 2025

wzshiming reopened this Dec 5, 2025

wzshiming force-pushed the optimize/dockerfile branch from af6226f to d8b8db2 Compare December 5, 2025 08:13

Copilot AI review requested due to automatic review settings December 5, 2025 08:13

wzshiming force-pushed the optimize/dockerfile branch 4 times, most recently from a88493d to 977bde2 Compare December 5, 2025 08:31

mergify bot added the documentation Improvements or additions to documentation label Dec 10, 2025

wzshiming force-pushed the optimize/dockerfile branch 16 times, most recently from 7518e43 to 31ea3a8 Compare December 10, 2025 10:52

wzshiming commented Dec 10, 2025

View reviewed changes

wzshiming force-pushed the optimize/dockerfile branch from 31ea3a8 to 1a52d5e Compare December 10, 2025 12:07

wzshiming changed the title ~~[WIP] optimize dockerfile~~ Optimize dockerfile Dec 10, 2025

Optimize dockerfile

70eefee

Signed-off-by: Shiming Zhang <[email protected]>

wzshiming force-pushed the optimize/dockerfile branch from 1a52d5e to 70eefee Compare December 10, 2025 12:44

orionr reviewed Dec 12, 2025

View reviewed changes

mergify bot added the needs-rebase label Dec 16, 2025

Uh oh!

Optimize dockerfile #28823

Are you sure you want to change the base?

Optimize dockerfile #28823

Conversation

wzshiming commented Nov 17, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot commented Nov 17, 2025

💡 Codex Review

Uh oh!

yeqcharlotte commented Dec 5, 2025

Uh oh!

wzshiming commented Dec 5, 2025

Uh oh!

mergify bot commented Dec 10, 2025

Uh oh!

wzshiming left a comment

Choose a reason for hiding this comment

Uh oh!

wzshiming Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Harry-Chen Dec 16, 2025

Choose a reason for hiding this comment

Uh oh!

wzshiming Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

wzshiming Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

wzshiming Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

wzshiming Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

wzshiming Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

wzshiming Dec 10, 2025

Choose a reason for hiding this comment

Uh oh!

wzshiming commented Dec 10, 2025

Uh oh!

orionr Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wzshiming Dec 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amrmahdi commented Dec 13, 2025

Uh oh!

wzshiming commented Dec 15, 2025

Uh oh!

amrmahdi commented Dec 15, 2025

Uh oh!

mergify bot commented Dec 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

wzshiming commented Nov 17, 2025 •

edited by github-actions bot

Loading

wzshiming Dec 10, 2025 •

edited

Loading

orionr Dec 12, 2025 •

edited

Loading

wzshiming Dec 15, 2025 •

edited

Loading