simplified-pyspark

Minimal standalone PySpark portability example for:

Docker
Singularity/Apptainer
HPC-style environments with remapped runtime UIDs

This repo is intentionally small and comes with a longer writeup on Medium: How You Can Make PySpark Work Across Docker, Singularity, and HPC.

Variants

baseline/: intentionally minimal starting point that reproduces failure modes
fixed/: portability-oriented version with runtime user repair, writable runtime paths, and safer Spark defaults

If you want one version to actually reuse, start with fixed.

Quickstart

Build and run with make:

make build-baseline
make build-fixed

make run-baseline
make run-fixed

Local Singularity targets:

make build-singularity-baseline
make build-singularity-fixed

make run-singularity-baseline
make run-singularity-fixed

Useful extra targets:

make run-baseline-random-uid
make run-fixed-random-uid
make run-baseline-udf
make run-fixed-udf
make run-baseline-udtf
make run-fixed-udtf

Without Make

Docker:

docker build -f baseline/Dockerfile -t simplified-pyspark:baseline .
docker build -f fixed/Dockerfile -t simplified-pyspark:fixed .

docker run --rm simplified-pyspark:baseline
docker run --rm simplified-pyspark:fixed

Local Singularity from local Docker images:

mkdir -p images

singularity build images/simplified-pyspark-baseline.sif \
  docker-daemon://simplified-pyspark:baseline

singularity build images/simplified-pyspark-fixed.sif \
  docker-daemon://simplified-pyspark:fixed

Run with the local/basic flags used in this repo:

singularity run --no-mount tmp --cleanenv --writable-tmpfs \
  images/simplified-pyspark-baseline.sif

singularity run --no-mount tmp --cleanenv --writable-tmpfs \
  --env HADOOP_CONF_DIR=/tmp \
  --env HADOOP_HOME=/tmp \
  --env "JAVA_TOOL_OPTIONS=-Djava.security.auth.login.config= -Dhadoop.security.authentication=simple -Dhadoop.security.authorization=false" \
  --bind /etc/passwd:/etc/passwd:ro \
  --bind /etc/group:/etc/group:ro \
  images/simplified-pyspark-fixed.sif

If the multi-arch images have been published, you can also build from the registry instead of docker-daemon://....

HPC Note

HPC Singularity commands are intentionally not wrapped in make.

In practice they usually depend on site-specific details such as:

account
partition
walltime
srun vs sbatch
cache and temp directory policy

The usual pattern is:

use srun ... --pty bash for an interactive compute-node shell
use sbatch for a batch job
keep APPTAINER_CACHEDIR on shared storage such as $SCRATCH
keep APPTAINER_TMPDIR on node-local storage such as $SLURM_TMPDIR

Example:

export APPTAINER_CACHEDIR=/scratch/$USER/.apptainer/cache
export APPTAINER_TMPDIR=${SLURM_TMPDIR:-/tmp/$USER-apptainer}
export TMPDIR=$APPTAINER_TMPDIR

For the full debugging story, exact failure messages, and the rationale behind baseline vs fixed, see the Medium article: How You Can Make PySpark Work Across Docker, Singularity, and HPC.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.github/workflows		.github/workflows
baseline		baseline
fixed		fixed
shared		shared
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

simplified-pyspark

Variants

Quickstart

Without Make

HPC Note

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

simplified-pyspark

Variants

Quickstart

Without Make

HPC Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages