Open
Description
Currently the spark distribution / hadoop libs in the image is installed using conda / pip which has a few implications.
- Because pip is being used some parts of the distribution are being left out (such as a
start-thriftserver.sh
script) - The location of the distribution is a weird one, as it's within the conda directory (/opt/miniconda3/lib/python3.8/site-packages/pyspark)
Other findings:
- variables like
SPARK_HOME
aren't set - Root user is being used
- Could be using a multi-stage build to reduce image size and to avoid uninstalling dependencies in the Dockerfile
Might also be an idea to use a spark base image, like https://github.com/bitnami/bitnami-docker-spark which improves on all of these points
Metadata
Metadata
Assignees
Labels
No labels