Conversation
MichaelSimons
left a comment
There was a problem hiding this comment.
I mainly looked at the dotnet-interactive/Dockerfile. I left a few suggestions/questions that may help simplify the Dockerfile.
There was a problem hiding this comment.
What is the usage in which this ARG would be specified when building? I am not seeing how this would work if a different user were specified. By this I mean I don't see the user being defined/added. The base image defines the "jovyan" user, why would you want do define something different? The same applies to NB_UID.
There was a problem hiding this comment.
Is this necessary? The base image is already defining ENV DEBIAN_FRONTEND=noninteractive
There was a problem hiding this comment.
Related to a previous comment, are these USER related ENVs necessary since they are defined in the base image.
There was a problem hiding this comment.
Since this is being proposed as part of the .NET project, telemetry should remain enabled.
There was a problem hiding this comment.
What is requiring all of these dependencies. Ones that are already provided by the base image don't seem necessary to include.
There was a problem hiding this comment.
Does this remove anything?
There was a problem hiding this comment.
This shouldn't be necessary because of the next line.
There was a problem hiding this comment.
This doesn't seem necessary.
There was a problem hiding this comment.
Per the Dockerfile best practices, it helps the readability to break this apart one package per line and alphabetize them.
|
@MichaelSimons, thank you very much for reviewing the Dockerfile and your valuable comments. I've just updated the PR to reflect the changes required for dotnet.spark version 1.0.0. This should also include an updated version of the Dockerfile, addressing your comments. |
MichaelSimons
left a comment
There was a problem hiding this comment.
@indy-3rdman, Thanks for addressing my comments. I took another look and had a few more suggestions and questions.
| && apt-get install -y --no-install-recommends \ | ||
| dialog apt-utils wget ca-certificates openjdk-8-jdk bash software-properties-common supervisor unzip socat net-tools vim \ | ||
| libc6 libgcc1 libgssapi-krb5-2 libicu60 libssl1.1 libstdc++6 zlib1g \ | ||
| apt-utils \ |
There was a problem hiding this comment.
What is requiring all of these native dependencies? Several are already provided by the base image so they don't seem necessary to declare.
There was a problem hiding this comment.
This should be cleaned up now. Java obviously is required by spark.
There was a problem hiding this comment.
Multiple packages listed together, should get split apart so that zlib1g is not overlooked.
There was a problem hiding this comment.
Should be in a separate line as well now.
There was a problem hiding this comment.
Consider: I've typically seen Dockerfile avoid using --show-progress as it does have a perf impact.
There was a problem hiding this comment.
That raises an interesting point about the purpose of the Dockerfile(s). As far as I am aware, the focus at the moment is to enable an user to build the image(s) her/himself, instead of automating the image build process. That's why I thought it would be useful to show the download progress. Now, for small downloads that doesn't really matter and I therefore have removed it. However, I have added the following line to the dotnet-spark/Dockerfile
&& echo "\nDownloading spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ..." \
as the spark download can take a while. Does that make sense?
There was a problem hiding this comment.
This should at some point perform checksum validation. @JeremyLikness, are checksums published for the Spark artifacts?
There was a problem hiding this comment.
Have you considered extracting the tarball to the /dotnet folder instead of extracting it to the working dir and then immediately mv it?
There was a problem hiding this comment.
Should be changed in the latest commit.
There was a problem hiding this comment.
I'm curious why this is necessary?
There was a problem hiding this comment.
Sorry, this is just a leftover from another file. Removed.
There was a problem hiding this comment.
I see the HelloSpark project is used to install the correct microsoft-spark-*.jar version that is required to start a spark-submit session in debug mode. Is it necessary to have the project in the resulting image? It feels a little strange to have a sample project in a "base" image.
There was a problem hiding this comment.
Moved from dotnet-spark-base to dotnet-spark Dockerfile. Additionally this is now removed, after the jar file has been copied.
There was a problem hiding this comment.
Have you considered using the multi-line ENV format? I find that is can help the readability of the Dockerfiles in that it helps the reader easily scan the Dockerfile. Tt makes it more obvious these are all ENV instructions.
ENV DAEMON_RUN=true \
DOTNETBACKEND_PORT=5567 \
HADOOP_VERSION=2.7 \
JUPYTER_ENABLE_LAB=true \
SPARK_VERSION=$SPARK_VERSION \
SPARK_HOME=/spark \
PATH="${SPARK_HOME}/bin:${DOTNET_WORKER_DIR}:${PATH}"
| ARG DOTNET_SPARK_VERSION=0.12.1 | ||
| ENV DOTNET_SPARK_VERSION=$DOTNET_SPARK_VERSION | ||
| ENV DOTNET_WORKER_DIR=/dotnet/Microsoft.Spark.Worker-${DOTNET_SPARK_VERSION} | ||
| ARG SPARK_VERSION=3.0.1 |
There was a problem hiding this comment.
Having version numbers like this hard coded gives me pause. Is this done so that the Dockerfile as it is checked in is buildable without having to specify any args? The problem that introduces is a maintenance burden of keeping it up-to-date.
There was a problem hiding this comment.
This is related to my earlier point about the purpose of the Dockerfile(s). The intention was to have a build-able Dockerfile even if the build script is not used. I agree with your observation about maintenance. Maybe @rapoth has a view on that.
| readonly image_repository='3rdman' | ||
| readonly supported_apache_spark_versions=("2.3.3" "2.3.4" "2.4.0" "2.4.1" "2.4.3" "2.4.4" "2.4.5" "2.4.6") | ||
| readonly supported_dotnet_spark_versions=("0.9.0" "0.10.0" "0.11.0" "0.12.1") | ||
| readonly supported_apache_spark_versions=( |
There was a problem hiding this comment.
This may be a question for Spark team. Thoughts on how to keep this version list up-to-date and other versions included in this script up-to-date? It feels like there should be long term plans for getting this updated "automatically" as part of the release process. Without this they will become stale and/or be a maintenance burden.
| ENV PATH="${PATH}:${HOME}/.dotnet/tools" | ||
|
|
||
| ENV DOTNET_RUNNING_IN_CONTAINER=true \ | ||
| ENV DOTNET_CORE_VERSION=$DOTNET_CORE_VERSION \ |
There was a problem hiding this comment.
Per the Dockerfile Best Practices, sort multi-line instructions to improve readability where possible (e.g. cross dependencies)
| && cd / \ | ||
| && echo "\nDownloading spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz ..." \ | ||
| && wget -q https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \ | ||
| && tar -xvzf spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz \ |
There was a problem hiding this comment.
You can extract to the spark directory with a single instruction which would eliminate the need for mv
There was a problem hiding this comment.
I assume you mean to use tar with --directory. But wouldn't that required that the directory exist already? In that case I'd have to add a mkdir first.
There was a problem hiding this comment.
You're correct, I missed what was happening here. Please ignore my comment.
There was a problem hiding this comment.
The unfortunate consequence of this pattern is that HelloSpark remains in the image as a result of obtaining it via COPY. This is not desirable. Is there a way this can be generated during the Docker build or can it be a published tarball so that is can get copied and deleted within a single Dockerfile instruction?
There was a problem hiding this comment.
Thanks again @MichaelSimons for your great feedback! Just creating a dummy project during the build process now.
This PR contains a build script, Dockerfile(s), README.md and supporting files to create a docker image that can run .NET for Apache Spark in an interactive Jupyter notebook.
An initial description for the interactive image, along with the folder structure can be found here: https://github.com/indy-3rdman/spark/tree/interactive-docker-image/docker/images/interactive