Skip to content

Make a start at optimising the backend container image builds#402

Closed
spwoodcock wants to merge 10 commits intodevelopfrom
build/multistage-image
Closed

Make a start at optimising the backend container image builds#402
spwoodcock wants to merge 10 commits intodevelopfrom
build/multistage-image

Conversation

@spwoodcock
Copy link
Copy Markdown
Member

@spwoodcock spwoodcock commented May 4, 2025

Issue

  • We had a user report on slack that the image built took up all their disk space.
  • Upon investigation, I noticed the final build in 19Gb and takes well over an hour to complete.

This PR

  • Only have one build directive in the compose file. Currently we build the same image for the api and worker in parallel, meaning the build cache requirements are doubled from the start.
  • Optimise the backend dockerfile by making it multi-stage.
  • Push the final image to Github container registry, to avoid the lengthy build time for contributors (they can simply pull). EDIT: I found ghcr.io/hotosm/fair_api:develop, so just referenced that instead.

The using a squashed multi-stage build shaved off 3GB, making it a total of 16GB now... I didn't get a chance to test it, but I see no reason why things would change (we simply remove the cache / temp files dumped during install). It could probably be optimised further, but I'll need to come back to it in another PR (or someone else could look at it 🙏)

Key question: do we actually need all the deps installed here?

Not included

  • I wanted to also address Minimize Dockerfile , Separate Docker image for worker and API #171, but I noticed there are multiple dockerfiles in the repo, plus the requirements.txt imports the api-requirements.txt.
  • I'm a bit time constrained currently so couldn't dig into what the implications of splitting out this dockerfile would be & if I need to update them all.
  • I think before addressing this we should consolidate and reorganise our docker setup.
  • The simplest solution would be to just add different stages api-build and worker-build, where we install only the required dependencies for each.
  • Also, we could definitely shave a few GB by only including libgdalXX in the final build, and not libgdal-dev.

@spwoodcock
Copy link
Copy Markdown
Member Author

I just broke this by splitting out the pip installs into multiple steps.

I think the issue is because the in-tree-build strategy is only available in more recent pip version.
We are probably constrained by the pip version available to use in Python 3.8, due to #403.
That issue should probably be addressed in parallel.

@spwoodcock spwoodcock marked this pull request as draft May 5, 2025 00:42
@EmceeEscher
Copy link
Copy Markdown

Sam, I took a look at this to see if I could use it to get my setup running locally. I was able to fix the errors in the build by removing the --use-feature=in-tree-build flag when installing solaris and scikit. The newest version of pip has in-tree-build as default behavior, so it's no longer a valid argument for --use-feature. That got me past that build step (although I still ran out of disk space 🥲).

@spwoodcock
Copy link
Copy Markdown
Member Author

spwoodcock commented May 6, 2025

Sam, I took a look at this to see if I could use it to get my setup running locally. I was able to fix the errors in the build by removing the --use-feature=in-tree-build flag when installing solaris and scikit. The newest version of pip has in-tree-build as default behavior, so it's no longer a valid argument for --use-feature. That got me past that build step (although I still ran out of disk space 🥲).

Oh nice, thanks for pointing that out!

Another thing I can think of: do you have the latest version of docker and compose installed? To ensure its using a modern version of buildkit. Also, if you set COMPOSE_BAKE=true in your environment, it will use a slightly more efficient builder.

But, the key thing we need to do here is possibly be more selective with the dependencies we install. Not sure if its possible to reduce down some of the ML libs or GPU drivers.

The quickest solution for you is for me to build and push the image to the registry, then you simply pull. Does that sound ok?

(Sorry I haven't got much time to dig into this further personally at the moment, but I think its on @kshitijrajsharma's todo list too) 😄

@spwoodcock
Copy link
Copy Markdown
Member Author

I removed the flags, but I still get an error on install:

321.9       File "/tmp/pip-build-env-splolahb/overlay/lib/python3.8/site-packages/pyproject_metadata/__init__.py", line 293, in __post_init__
321.9         self.validate()
321.9       File "/tmp/pip-build-env-splolahb/overlay/lib/python3.8/site-packages/pyproject_metadata/__init__.py", line 516, in validate
321.9         except packaging.utils.InvalidName:
321.9     AttributeError: module 'packaging.utils' has no attribute 'InvalidName'
321.9     ----------------------------------------
321.9 ERROR: Command errored out with exit status 1: /usr/bin/python3 /usr/local/lib/python3.8/dist-packages/pip/_vendor/pep517/_in_process.py prepare_metadata_for_build_wheel /tmp/tmpbdvn1osd Check the logs for full command output.
321.9 WARNING: You are using pip version 20.2.4; however, version 25.0.1 is available.
321.9 You should consider upgrading via the '/usr/bin/python3 -m pip install --upgrade pip' command.
------
failed to solve: process "/bin/bash -c pip install /tmp/solaris &&     pip install scikit-fmm" did not complete successfully: exit code: 1

I would assume that Python 3.8 doesn't support the latest version of scikit-fmm, as it's not pinned to any version.
Will have to come back to this!

@spwoodcock spwoodcock marked this pull request as ready for review May 7, 2025 00:24
@spwoodcock
Copy link
Copy Markdown
Member Author

Fixed 👍

@kshitijrajsharma
Copy link
Copy Markdown
Member

kshitijrajsharma commented May 8, 2025

Coming back to this :

Key question: do we actually need all the deps installed here?

Perhaps not , the ideal solution would be to take ramp-dependencies out build it in a sep image file and import it our worker docker image , the tensorflow image size is around 3.6gb which https://hub.docker.com/r/tensorflow/tensorflow/tags I am assuming it will take up to 5 gb after gdal and binding !

THIS PR still failing the build of worker , I will investigate what happened !

@spwoodcock
Copy link
Copy Markdown
Member Author

Lol no need to investigate 😆

image

The runner ran out of space building!

@kshitijrajsharma
Copy link
Copy Markdown
Member

kshitijrajsharma commented May 9, 2025

haha ! I am reducing the size !

@spwoodcock
Copy link
Copy Markdown
Member Author

Closing in favour of #405

@spwoodcock spwoodcock closed this May 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants