Minimal standalone PySpark portability example for:
- Docker
- Singularity/Apptainer
- HPC-style environments with remapped runtime UIDs
This repo is intentionally small and comes with a longer writeup on Medium: How You Can Make PySpark Work Across Docker, Singularity, and HPC.
baseline/: intentionally minimal starting point that reproduces failure modesfixed/: portability-oriented version with runtime user repair, writable runtime paths, and safer Spark defaults
If you want one version to actually reuse, start with fixed.
Build and run with make:
make build-baseline
make build-fixed
make run-baseline
make run-fixedLocal Singularity targets:
make build-singularity-baseline
make build-singularity-fixed
make run-singularity-baseline
make run-singularity-fixedUseful extra targets:
make run-baseline-random-uid
make run-fixed-random-uid
make run-baseline-udf
make run-fixed-udf
make run-baseline-udtf
make run-fixed-udtfDocker:
docker build -f baseline/Dockerfile -t simplified-pyspark:baseline .
docker build -f fixed/Dockerfile -t simplified-pyspark:fixed .
docker run --rm simplified-pyspark:baseline
docker run --rm simplified-pyspark:fixedLocal Singularity from local Docker images:
mkdir -p images
singularity build images/simplified-pyspark-baseline.sif \
docker-daemon://simplified-pyspark:baseline
singularity build images/simplified-pyspark-fixed.sif \
docker-daemon://simplified-pyspark:fixedRun with the local/basic flags used in this repo:
singularity run --no-mount tmp --cleanenv --writable-tmpfs \
images/simplified-pyspark-baseline.sif
singularity run --no-mount tmp --cleanenv --writable-tmpfs \
--env HADOOP_CONF_DIR=/tmp \
--env HADOOP_HOME=/tmp \
--env "JAVA_TOOL_OPTIONS=-Djava.security.auth.login.config= -Dhadoop.security.authentication=simple -Dhadoop.security.authorization=false" \
--bind /etc/passwd:/etc/passwd:ro \
--bind /etc/group:/etc/group:ro \
images/simplified-pyspark-fixed.sifIf the multi-arch images have been published, you can also build from the registry instead of docker-daemon://....
HPC Singularity commands are intentionally not wrapped in make.
In practice they usually depend on site-specific details such as:
- account
- partition
- walltime
srunvssbatch- cache and temp directory policy
The usual pattern is:
- use
srun ... --pty bashfor an interactive compute-node shell - use
sbatchfor a batch job - keep
APPTAINER_CACHEDIRon shared storage such as$SCRATCH - keep
APPTAINER_TMPDIRon node-local storage such as$SLURM_TMPDIR
Example:
export APPTAINER_CACHEDIR=/scratch/$USER/.apptainer/cache
export APPTAINER_TMPDIR=${SLURM_TMPDIR:-/tmp/$USER-apptainer}
export TMPDIR=$APPTAINER_TMPDIRFor the full debugging story, exact failure messages, and the rationale behind baseline vs fixed, see the Medium article: How You Can Make PySpark Work Across Docker, Singularity, and HPC.