Skip to content

Conversation

@wzshiming
Copy link
Contributor

@wzshiming wzshiming commented Nov 17, 2025

Fixes #28641

Purpose

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added the ci/build label Nov 17, 2025
@wzshiming wzshiming force-pushed the optimize/dockerfile branch 2 times, most recently from 28a3425 to b89d1f6 Compare November 17, 2025 01:36
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to optimize the Dockerfile by leveraging build cache mounts, combining RUN layers, and reordering instructions to improve layer caching. While most of the changes are beneficial for build performance and image size, I've identified two critical issues that could break the Docker build. One issue involves moving a COPY instruction for the source code to a point after it's needed by a build step, which will cause the build to fail. The second issue is an incorrect environment variable setup for a Python virtual environment in a stage where no such environment is created. My review includes specific comments and suggestions to address these critical problems.

Comment on lines 208 to 212
COPY . .
ARG GIT_REPO_CHECK=0
RUN --mount=type=bind,source=.git,target=.git \
if [ "$GIT_REPO_CHECK" != "0" ]; then bash tools/check_repo.sh ; fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The wheel build step that uses sccache (lines 186-206) requires the project's source code to run python3 setup.py bdist_wheel. By moving COPY . . to after this step, the build will fail when USE_SCCACHE=1 because essential files like setup.py will be missing. To fix this, the source code must be copied before both wheel build steps (the one with sccache and the one without). Please move these lines to before line 186.

@chatgpt-codex-connector
Copy link

💡 Codex Review

vllm/docker/Dockerfile

Lines 204 to 208 in 491dbd5

&& python3 setup.py bdist_wheel --dist-dir=dist --py-limited-api=cp38 \
&& sccache --show-stats; \
fi
COPY . .

P1 Badge Copy sources before sccache build

When USE_SCCACHE=1, the build stage now runs python3 setup.py bdist_wheel before any project files are copied into the image. With no setup.py or source tree present, that branch will fail immediately, so sccache-accelerated builds can no longer succeed. Copying the workspace has to happen before invoking the wheel build in this branch.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@chaunceyjiang chaunceyjiang self-assigned this Nov 17, 2025
@chaunceyjiang chaunceyjiang added the ready ONLY add when PR is ready to merge/full CI is needed label Nov 17, 2025
@wzshiming wzshiming force-pushed the optimize/dockerfile branch 11 times, most recently from 777fa4f to af6226f Compare November 17, 2025 09:56
@yeqcharlotte
Copy link
Collaborator

@wzshiming @chaunceyjiang @rzabarazesh @amrmahdi - any followup on this pr? do you have test results?

cc: @robertgshaw2-redhat

@wzshiming wzshiming closed this Dec 5, 2025
@wzshiming
Copy link
Contributor Author

This PR relies on vllm-project/ci-infra#212 which was just merged yesterday, I will continue to work on this

@wzshiming wzshiming reopened this Dec 5, 2025
@wzshiming wzshiming force-pushed the optimize/dockerfile branch from af6226f to d8b8db2 Compare December 5, 2025 08:13
Copilot AI review requested due to automatic review settings December 5, 2025 08:13
@wzshiming wzshiming force-pushed the optimize/dockerfile branch 4 times, most recently from a88493d to 977bde2 Compare December 5, 2025 08:31
@mergify
Copy link

mergify bot commented Dec 10, 2025

Documentation preview: https://vllm--28823.org.readthedocs.build/en/28823/

@mergify mergify bot added the documentation Improvements or additions to documentation label Dec 10, 2025
@wzshiming wzshiming force-pushed the optimize/dockerfile branch 16 times, most recently from 7518e43 to 31ea3a8 Compare December 10, 2025 10:52
Copy link
Contributor Author

@wzshiming wzshiming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is almost done. I'll check how much the speed has increased in the case of cached


RUN --mount=type=cache,target=/root/.cache/uv \
RUN --mount=type=cache,target=/root/.cache/uv,sharing=locked \
--mount=type=bind,source=.git,target=.git \
Copy link
Contributor Author

@wzshiming wzshiming Dec 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docker build cache misses everything since the presence of the .git causes it to do so.

This depends on the setup feature that automatically acquires the current version via the setuptools_scm.get_version

It can be specified using SETUPTOOLS_SCM_PRETEND_VERSION, but the actual version number is necessary here

I suggest using the --build-arg to set SETUPTOOLS_SCM_PRETEND_VERSION on CI, and remove the .git mount, and If accepted. If this suggestion is accepted, I will follow up on it #30686.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep in mind that these images are used in both the CI pipeline and release pipeline. If you change the auto-detection to a build argument, you need to do it in the both pipeline config file / generator, e.g.:

docker build --build-arg WHEEL_VERSION=$(python3 -m setuptools_scm -f plain) ...

fi
#################### WHEEL BUILD IMAGE ####################

#################### DEEPGEMM BUILD IMAGE ####################
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build DeepDEEM as a separate stage to avoid cache misses

RUN mkdir -p /tmp/deepgemm/dist && touch /tmp/deepgemm/dist/.deepgemm_skipped
#################### DEEPGEMM BUILD IMAGE ####################

#################### EXTENSION BUILD IMAGE ####################
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

# note that this uses vllm installed by `pip`
FROM vllm-base AS test

ADD . /vllm-workspace/
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to end of this stage

Comment on lines -464 to -467
COPY examples examples
COPY benchmarks benchmarks
COPY ./vllm/collect_env.py .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't seem to be used at this stage, so moved to vllm-openai-base


# Install system dependencies and uv, then create Python virtual environment
RUN echo 'tzdata tzdata/Areas select America' | debconf-set-selections \
RUN --mount=type=cache,id=apt-build,target=/var/lib/apt/lists \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

base uses id=apt-build, and vllm-base uses id=apt-final, since they are based on different ubuntu versions.

Comment on lines -434 to 455
RUN --mount=type=cache,target=/root/.cache/uv \
RUN --mount=type=cache,target=/root/.cache/pip \
python3 -m pip install uv
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is installing uv with pip, so the uv cache is invalid.

@wzshiming wzshiming changed the title [WIP] optimize dockerfile Optimize dockerfile Dec 10, 2025
Signed-off-by: Shiming Zhang <[email protected]>
@wzshiming
Copy link
Contributor Author

@chaunceyjiang @rzabarazesh @amrmahdi - Please take a look if you have the time

COPY requirements/common.txt requirements/common.txt
COPY requirements/cuda.txt requirements/cuda.txt
RUN --mount=type=cache,target=/root/.cache/uv \
--mount=type=bind,source=requirements/cuda.txt,target=requirements/cuda.txt,ro \
Copy link

@orionr orionr Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wzshiming how much optimization do you get from these changes to reference and read-only for requirements/*.txt files? I'm checking because I have a parallel set of changes that need to rewrite these files for PyTorch nightlies (remove torch, torchaudio, torchtext) and also be used for regular builds at https://github.com/vllm-project/vllm/pull/30443/changes. Seems like we'll have some conflict, but I love the performance improvements here. cc @atalman @huydhn

Copy link
Contributor Author

@wzshiming wzshiming Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This optimization removes two layers from the image, which is useful for the vllm-base stage — the final artifacts.

However, the effect is not significant for the base stage.

I don't mind resolving the conflict, so feel free to merge them first.

@amrmahdi
Copy link
Contributor

Hi @wzshiming, thanks for this great work on optimizing the Dockerfile!

I've been working on a similar optimization in a different PR: #30626

Since you've been focusing on cache mounts and bind mounts to avoid layer invalidation, I focused more on rearranging the layers to pre-install slow-changing dependencies in vllm-base before the vLLM wheel installation. This way, incremental builds with Python-only changes can skip these expensive layers entirely.

Combining both approaches should give us the maximum benefit, lets coordinate the merge to avoid conflicts.

@wzshiming
Copy link
Contributor Author

Hi @wzshiming, thanks for this great work on optimizing the Dockerfile!

I've been working on a similar optimization in a different PR: #30626

Since you've been focusing on cache mounts and bind mounts to avoid layer invalidation, I focused more on rearranging the layers to pre-install slow-changing dependencies in vllm-base before the vLLM wheel installation. This way, incremental builds with Python-only changes can skip these expensive layers entirely.

Combining both approaches should give us the maximum benefit, lets coordinate the merge to avoid conflicts.

I've had a look. The stage's dependencies have been changed nicely — it's very clear, but it seems that the position of the stage has been adjusted, so a conflict can't be avoided.

However, I don't mind resolving the conflict, so feel free to merge them first.

@amrmahdi
Copy link
Contributor

Hi @wzshiming, thanks for this great work on optimizing the Dockerfile!
I've been working on a similar optimization in a different PR: #30626
Since you've been focusing on cache mounts and bind mounts to avoid layer invalidation, I focused more on rearranging the layers to pre-install slow-changing dependencies in vllm-base before the vLLM wheel installation. This way, incremental builds with Python-only changes can skip these expensive layers entirely.
Combining both approaches should give us the maximum benefit, lets coordinate the merge to avoid conflicts.

I've had a look. The stage's dependencies have been changed nicely — it's very clear, but it seems that the position of the stage has been adjusted, so a conflict can't be avoided.

However, I don't mind resolving the conflict, so feel free to merge them first.

Thanks @wzshiming, FWIW with my PR the rebuilds with python only changes affecting the leaf stage now takes ~16 minutes, see https://buildkite.com/vllm/ci/builds/43561/steps/canvas?sid=019b207f-7cc2-4327-8035-07fa0e925428

@mergify
Copy link

mergify bot commented Dec 16, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @wzshiming.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Dec 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation needs-rebase ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature][P0]: Optimize Dockerfile Layer Ordering

6 participants