feat: opt-in Apptainer container support#74
Open
wietzesuijker wants to merge 3 commits into
Open
Conversation
Adds opt-in container support for clusters where venv-copy is impractical (offline compute nodes, cross-cluster reproducibility). - `ContainerConfig` in `[tool.cluv.clusters.<name>.container]` - `cluv build <cluster>` exports pinned deps, builds a .sif on the remote, deploys to `deploy_path` with git-SHA tagging - `cluv submit` auto-injects `CONTAINER_PATH` when container config is present - `container_job.sh` template using `apptainer exec --nv` - GOMAXPROCS=1 guard for DRAC login-node pids.max cgroup limit
… tests - Verify container imports after build, before deploy - Clean up /tmp/cluv-build on all exit paths (success and failure) - Reject relative deploy_path at config parse time - Add /dev/shm bind, MPLCONFIGDIR, TORCHDYNAMO_DISABLE to container_job.sh - 3 tests for CONTAINER_PATH injection in submit (present, absent, override) - Test that relative deploy_path raises ValueError
- SIF named <project>-<sha>.sif, not train-<sha>.sif - container_job.sh: quote all variable paths, bind /project on DRAC - Simplify verify command (no triple-escape) - Hint 'uv lock' when uv export --locked fails due to stale lock - Drop unnecessary TYPE_CHECKING import
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
cluv currently syncs a venv to the cluster and copies it into each job's
$SLURM_TMPDIRat start. This works well on Mila, but on DRAC it means every dependency must be pre-staged (several clusters have no internet on compute), the copy adds minutes per job, and the environment can drift between submissions.Containers flip this: build the environment once into an immutable
.sifimage, tagged by git SHA, and run jobs directly from it. No copy step, no drift, no internet needed at runtime.Based on a setup that's been running on Rorqual and Narval in RolnickLab/radar-transferability for a few months.
What
Opt-in, per-cluster container support. Clusters without
[container]config are unchanged.cluv build <cluster>exports pinned deps fromuv.lock, builds a.sifon the remote, deploys with a git-SHA tag andcurrent.sifsymlink.cluv submitauto-injectsCONTAINER_PATHwhen the cluster has container config.scripts/container_job.shruns viaapptainer execinstead of venv activation.Build includes the
GOMAXPROCS=1workaround for DRAC login nodes (pids.max=512 cgroup limit kills Go's default thread spawning).AI (Claude) supported my development of this PR.