Skip to content

Conversation

@thesuperzapper
Copy link

This is a rework of the build process to be more efficient and work properly.

Highlights:

  • Only use docker for environment, don't build inside the image.
  • Move to Python3 only (Python 2 is being dropped by Airflow very soon)
  • Fix issues with metadata not being correct. (E.g. airflow_env.sh never being used)
  • Work out of a tmp folder, rather than polluting git root.
  • Allow user to specify the distro they want with --dist.
  • Fix licence headers
  • Remove Debian 7 (EOL)

To consider

  • We should probably enable --enable-optimisations when building python, but this does massively increase build times.

@thesuperzapper
Copy link
Author

@razorsedge I see you are also working on changes, perhaps we could discuss this more. (I have added you on LinkedIn)

@thesuperzapper thesuperzapper marked this pull request as ready for review July 28, 2019 22:11
@thesuperzapper thesuperzapper mentioned this pull request Aug 5, 2019
@razorsedge
Copy link
Contributor

I have finally managed to give this PR a try. Interesting.

I think the main thing that I am confused about is the change to not build inside the image. My limited knowledge of using Docker to build software tells me that using the caching layers in a Dockerfile can speed up the repeated build process by not re-running things that have not changed. ie If I am only changing the alternatives.json then why would I need to recompile Python? Every time this PR re-runs the build, it runs through all of the steps.

@thesuperzapper
Copy link
Author

@razorsedge You are correct that with this PR python would need to be recompiled for any change. However, this is easily fixed by caching the compiled python binaries on the docker host file-system.

This PR also lets you build for only a specific distro: see here

Also note that the existing approach is likely to have similar issues, due to things like PARCEL_VERSION and AIRFLOW_VERSION changing between runs.

Also note: I am still testing this all out in a lab environment, and am about to propose a large change to the CSD as well.

@razorsedge
Copy link
Contributor

The existing approach will re-use cached layers that have not changed. Changing the *_VERSION variables will of course trigger a rebuild of those layers that are affected. Which is what I would expect it to do. What I am not wanting is to re-run everything just because I am iterating on fixing a problem further down in the Dockerfile (ie recompile Python every time I make changes to install_airflow.sh.

I can see a use for building only a specific distro. And I am all for compilation optimizations that don't cause support headaches.

@thesuperzapper
Copy link
Author

thesuperzapper commented Aug 12, 2019

@razorsedge The main issue with the existing approach is that it has extreme duplication of commands within the docker files, almost all of the build commands are the same across all versions of linux.

I can make this process cache the python build locally, as it is the longest part of the process, if that is a concern to you.

In my testing the parcels created by this PR work well (with my yet to be finalised changes of the CSD), there is only one change I need to make, as in the current state, the spark-submit commands from airflow workers will be corrupted by the PYTHONPATH/PTHONHOME variables in airflow-env.sh. (I will simply remove these)

PS: also note, I have removed much of the complexity with this PR, e.g. install_python.sh no longer exists.

The two files you should look at are build_airflow_parcel.sh and docker_entrypoint.sh as they have replaced all other .sh files.

@thesuperzapper
Copy link
Author

@razorsedge to be honest, we should move to Anaconda/Miniconda as this will remove the need for us to compile python, and those binaries are already compiled with optimisations, (which takes much more time).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants