-
Notifications
You must be signed in to change notification settings - Fork 10
[WIP] Completely rework the build process #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
@razorsedge I see you are also working on changes, perhaps we could discuss this more. (I have added you on LinkedIn) |
|
I have finally managed to give this PR a try. Interesting. I think the main thing that I am confused about is the change to not build inside the image. My limited knowledge of using Docker to build software tells me that using the caching layers in a Dockerfile can speed up the repeated build process by not re-running things that have not changed. ie If I am only changing the |
|
@razorsedge You are correct that with this PR python would need to be recompiled for any change. However, this is easily fixed by caching the compiled python binaries on the docker host file-system. This PR also lets you build for only a specific distro: see here Also note that the existing approach is likely to have similar issues, due to things like PARCEL_VERSION and AIRFLOW_VERSION changing between runs. Also note: I am still testing this all out in a lab environment, and am about to propose a large change to the CSD as well. |
|
The existing approach will re-use cached layers that have not changed. Changing the *_VERSION variables will of course trigger a rebuild of those layers that are affected. Which is what I would expect it to do. What I am not wanting is to re-run everything just because I am iterating on fixing a problem further down in the Dockerfile (ie recompile Python every time I make changes to I can see a use for building only a specific distro. And I am all for compilation optimizations that don't cause support headaches. |
|
@razorsedge The main issue with the existing approach is that it has extreme duplication of commands within the docker files, almost all of the build commands are the same across all versions of linux. I can make this process cache the python build locally, as it is the longest part of the process, if that is a concern to you. In my testing the parcels created by this PR work well (with my yet to be finalised changes of the CSD), there is only one change I need to make, as in the current state, the spark-submit commands from airflow workers will be corrupted by the PYTHONPATH/PTHONHOME variables in airflow-env.sh. (I will simply remove these) PS: also note, I have removed much of the complexity with this PR, e.g. The two files you should look at are build_airflow_parcel.sh and docker_entrypoint.sh as they have replaced all other .sh files. |
|
@razorsedge to be honest, we should move to Anaconda/Miniconda as this will remove the need for us to compile python, and those binaries are already compiled with optimisations, (which takes much more time). |
c2ba78d to
29e975c
Compare
This is a rework of the build process to be more efficient and work properly.
Highlights:
To consider
--enable-optimisationswhen building python, but this does massively increase build times.