Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(docker): reduce size between docker builds #7571

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

keturn
Copy link
Contributor

@keturn keturn commented Jan 18, 2025

by adding a layer with all the pytorch dependencies that don't change most of the time.

Summary

Every time the main docker images rebuild and I pull main-cuda, it gets another 3+ GB, which seems like about a zillion times too much since most things don't change from one commit on main to the next.

This is an attempt to follow the guidance in Using uv in Docker: Intermediate Layers so there's one layer that installs all the dependencies—including PyTorch with its bundled nvidia libraries—before the project's own frequently-changing files are copied in to the image.

Related Issues / Discussions

QA Instructions

Hopefully the CI system building the docker images is sufficient.

But there is one change to pyproject.toml related to xformers, so it'd be worth checking that python -m xformers.info still says it has triton on the platforms that expect it.

Merge Plan

I don't expect this to be a disruptive merge.

(An earlier revision of this PR moved the venv, but I've reverted that change at ebr's recommendation.)

Checklist

  • The PR has a short but descriptive title, suitable for a changelog
  • Tests added / updated (if applicable)
  • Documentation added / updated (if applicable)
  • Updated What's New copy (if doing a release after this PR)

by adding a layer with all the pytorch dependencies that don't change most of the time.
@github-actions github-actions bot added docker Root python-deps PRs that change python dependencies labels Jan 18, 2025
Comment on lines -104 to -105
# Auxiliary dependencies, pinned only if necessary.
"triton; sys_platform=='linux'",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, CUDA builds of torch 2.4.1 have a dependency on triton, which has two consequences:

  1. triton is installed without us declaring the dependency here.
  2. some versions of triton conflict with that torch dependency. i.e. torch 2.4.1+cu124 requires 3.0.0 and will conflict with 3.1.0.

If we had an exact dependency on torch==2.4.1+cu124, uv's resolver would figure it out which version of triton works, and it would be fine.

However, since the torch version is left ambiguous, it's possible for the resolver to decide that non-CUDA torch==2.4.1 is a better solution since that doesn't conflict with the latest version of xformers. 😖

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll hopefully be bumping torch soon to support the new Nvidia GPUs, so we'll address it then. There's also an additional quirk of requiring a different torch pin for AMD, so the rabbit hole will go even deeper.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, and uv has a documentation page about torch specifically: https://docs.astral.sh/uv/guides/integration/pytorch/#configuring-accelerators-with-optional-dependencies

I did make an attempt at implementing that here, but its scope quickly grew far past what I wanted to deal with in this docker-focused PR.

@keturn
Copy link
Contributor Author

keturn commented Jan 18, 2025

This Dockerfile is also quirky in that it separates builder and runtime stages, but then it puts all the build-deps in the runtime stage anyway (to build patchmatch?), which kinda defeats the purpose. But I think we can leave that alone for now as an independent concern.

There is one thing I haven't confirmed for the space savings: my test builds have been with podman (buildah), not buildkit. buildah doesn't support COPY --link, so the interaction between the stages and layers isn't exactly the same…

@ebr
Copy link
Member

ebr commented Feb 15, 2025

hey @keturn , thanks for this! we tried some time ago with pip, but it caused more headaches than was worth the trouble. Now with uv this seems like a very sound approach.
This will work well for someone building the images locally. But just to set expectations: we're not currently doing any caching of Docker layers in GHA runners (we might revisit that at some point), so the intermediate layers will be rebuilt anyway. So IF the expectation is to not have to pull the 2.5GB pytorch layer every time - that is unfortunately still going to continue happening, at least for now. But again, if you're building the image locally, it should help quite a bit on rebuilds.
That said, the GHA docker builds are failing right now, so once I fix that perhaps we can rebase this PR and I'll come back to re-reviewing it. Will keep you posted.

@ebr
Copy link
Member

ebr commented Feb 15, 2025

This Dockerfile is also quirky in that it separates builder and runtime stages, but then it puts all the build-deps in the runtime stage anyway (to build patchmatch?), which kinda defeats the purpose. But I think we can leave that alone for now as an independent concern.

Indeed, this is needed to build patchmatch. Hope at some point we no longer have to do this, but for now it's ok. There's still a small benefit to using the runner image because we just don't need to worry about any other cruft that may be left in the builder image, so i'd like to keep it this way for the time being.

ENV PYTHON_VERSION=3.11
ENV INVOKEAI_ROOT=/invokeai
ENV INVOKEAI_HOST=0.0.0.0
ENV INVOKEAI_PORT=9090
ENV PATH="$VIRTUAL_ENV/bin:$INVOKEAI_SRC:$PATH"
ENV PATH="$INVOKEAI_SRC/.venv/bin:$INVOKEAI_SRC:$PATH"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the .venv out of the source dir enables using this image for development on non-Linux systems. E.g. someone on a Mac can mount their INVOKEAI_SRC into the image while avoiding compatibility issues with their local macOS-specific .venv. We should keep it as-is

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alright, I believe I've resolved this so it's back to using /opt/venv

keturn and others added 3 commits February 16, 2025 10:34
including just invokeai/version seems sufficient to appease uv sync here. including everything else would invalidate the cache we're trying to establish.
@keturn
Copy link
Contributor Author

keturn commented Feb 16, 2025

But just to set expectations: we're not currently doing any caching of Docker layers in GHA runners (we might revisit that at some point), so the intermediate layers will be rebuilt anyway. So IF the expectation is to not have to pull the 2.5GB pytorch layer every time - that is unfortunately still going to continue happening, at least for now.

This got me to re-visit the docs on docker cache invalidation. From what I gather, the upshot is

  • no build-cache means those layers will be re-built every time, and
  • bit-for-bit reproducible docker builds are still an esoteric subject, so
  • a re-built layer gets published as a new thing, even if its contents are semantically the same?

well, that's a bit disappointing, but I guess this PR is still a step in the right direction if we are going all-in with uv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
docker python-deps PRs that change python dependencies Root
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants