Skip to content

feat: opt-in Apptainer container support#74

Open
wietzesuijker wants to merge 3 commits into
mila-iqia:masterfrom
wietzesuijker:feat/apptainer-build
Open

feat: opt-in Apptainer container support#74
wietzesuijker wants to merge 3 commits into
mila-iqia:masterfrom
wietzesuijker:feat/apptainer-build

Conversation

@wietzesuijker
Copy link
Copy Markdown

Why

cluv currently syncs a venv to the cluster and copies it into each job's $SLURM_TMPDIR at start. This works well on Mila, but on DRAC it means every dependency must be pre-staged (several clusters have no internet on compute), the copy adds minutes per job, and the environment can drift between submissions.

Containers flip this: build the environment once into an immutable .sif image, tagged by git SHA, and run jobs directly from it. No copy step, no drift, no internet needed at runtime.

Based on a setup that's been running on Rorqual and Narval in RolnickLab/radar-transferability for a few months.

What

Opt-in, per-cluster container support. Clusters without [container] config are unchanged.

[tool.cluv.clusters.rorqual.container]
deploy_path = "/project/acct/containers"
  • cluv build <cluster> exports pinned deps from uv.lock, builds a .sif on the remote, deploys with a git-SHA tag and current.sif symlink.
  • cluv submit auto-injects CONTAINER_PATH when the cluster has container config.
  • scripts/container_job.sh runs via apptainer exec instead of venv activation.

Build includes the GOMAXPROCS=1 workaround for DRAC login nodes (pids.max=512 cgroup limit kills Go's default thread spawning).


AI (Claude) supported my development of this PR.

Adds opt-in container support for clusters where venv-copy is
impractical (offline compute nodes, cross-cluster reproducibility).

- `ContainerConfig` in `[tool.cluv.clusters.<name>.container]`
- `cluv build <cluster>` exports pinned deps, builds a .sif on the
  remote, deploys to `deploy_path` with git-SHA tagging
- `cluv submit` auto-injects `CONTAINER_PATH` when container config
  is present
- `container_job.sh` template using `apptainer exec --nv`
- GOMAXPROCS=1 guard for DRAC login-node pids.max cgroup limit
… tests

- Verify container imports after build, before deploy
- Clean up /tmp/cluv-build on all exit paths (success and failure)
- Reject relative deploy_path at config parse time
- Add /dev/shm bind, MPLCONFIGDIR, TORCHDYNAMO_DISABLE to container_job.sh
- 3 tests for CONTAINER_PATH injection in submit (present, absent, override)
- Test that relative deploy_path raises ValueError
- SIF named <project>-<sha>.sif, not train-<sha>.sif
- container_job.sh: quote all variable paths, bind /project on DRAC
- Simplify verify command (no triple-escape)
- Hint 'uv lock' when uv export --locked fails due to stale lock
- Drop unnecessary TYPE_CHECKING import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant