fix(submit): correct the cluv/sbatch contract on DRAC by wietzesuijker · Pull Request #71 · mila-iqia/cluv

wietzesuijker · 2026-05-21T21:09:16Z

Today, if you set SBATCH_MEM = "2G" in your pyproject.toml and cluv submit to a DRAC cluster (rorqual, narval, fir, ...), SLURM quietly ignores you and gives the job whatever the cluster default is. sacct will show ReqMem=256M. There's no warning. Your job just runs with less memory than you asked for. Mila is fine. This is happening on every cluv submit to DRAC.

The reason turns out to be a small mismatch in how cluv hands resource requests to sbatch. cluv builds the remote command like this:

ssh rorqual 'bash --login -c "SBATCH_MEM=2G ... sbatch --parsable job.sh"'

The idea is that sbatch picks SBATCH_MEM out of its environment. That works on Mila. On DRAC, bash --login re-sources /etc/profile.d/*.sh, which resets SBATCH_* defaults a moment before sbatch starts reading. Your value is wiped before it ever gets a chance. Two control runs on rorqual make it concrete:

$ ssh rorqual 'bash --login -c "SBATCH_MEM=2G sbatch --wrap=\"sleep 30\""'  =>  ReqMem=256M
$ ssh rorqual 'bash --login -c "sbatch --mem=2G --wrap=\"sleep 30\""'        =>  ReqMem=2G

Same shell, same login, same partition. Only the channel changes. Env vars get clobbered; CLI flags don't.

The fix is to switch channels. Known SBATCH_* keys get translated into the matching sbatch CLI flags right before invoking sbatch. Flags are parsed by sbatch itself, so the login shell can't touch them. From your side as a user, nothing changes: you keep writing the same keys in your pyproject.toml. Only the wire format on the remote shell is different:

pyproject.toml (unchanged)	wire format
`SBATCH_MEM = "2G"`	`... --mem=2G ...`
`SBATCH_TIME = "00:05:00"`	`... --time=00:05:00 ...`
`SBATCH_ACCOUNT = "rrg-foo"`	`... --account=rrg-foo ...`

The full mapping lives in SBATCH_ENV_TO_FLAG in cluv/cli/submit.py. Anything not in that table (GIT_COMMIT, your own custom vars, unusual SBATCH_* keys) keeps going through as a plain env var, so your existing setups don't need to change.

Two adjacent bugs in the same file showed up while debugging this and are bundled in. The first: sacct formats its State column as fixed-width text with a 10-character cap, and OUT_OF_MEMORY overflows that to OUT_OF_ME+, so any code reading the full string silently misses OOMs. --parsable2 switches sacct to machine-readable output with no width cap. The second: the wait loop in submit_first treats anything outside ["PENDING", "RUNNING"] as terminal, so a poll that catches a job in COMPLETING (the post-script cleanup phase, which can last minutes on busy nodes) exits one step early. Inverted to an explicit terminal-state list, with a small guard for the empty-string case when sacct hasn't populated yet.

Verification

End-to-end smoke with SBATCH_MEM = "2G" and a numpy alloc that overshoots:

cluster	ReqMem (before)	ReqMem (after)	terminal state
Mila	2G	2G	`OUT_OF_MEMORY` (full)
Rorqual	256M	2G	`OUT_OF_MEMORY` (full)

Rorqual is where you can see the bug: pre-fix, the 2G ask got silently downgraded to the cluster default. Mila confirms the wire-format change doesn't break anything where the old env-var path already worked.

By default sacct formats the State column as fixed-width text with a 10-char cap and a trailing '+' on overflow. The only common state that overflows is OUT_OF_MEMORY (13 chars) → OUT_OF_ME+, so any consumer string-comparing against the full name silently fails. Latent on master because get_job_status is only called from submit_first, which inspects PENDING/RUNNING (both 7 chars). Fix it now anyway: --parsable2 switches sacct to pipe-separated machine output without column-width constraints, so state strings are always full.

…ansient Master had RUNNING_JOB_STATES = ["PENDING", "RUNNING"] and treated everything else as terminal. A poll that landed in COMPLETING (the post-script cleanup phase — cgroup teardown, log sync; can last seconds to minutes) would exit one step early. Same for less common transient states like SUSPENDED, REQUEUED, RESIZING, STAGE_OUT. Inverting the predicate is the more defensible shape: enumerate the terminal set (COMPLETED + failure modes) and treat anything else as "keep polling". An unknown state from a future SLURM release now defaults to safe-wait instead of silent-misclassify. The single call site in submit_first gets the inverted check with an empty-string guard so a sacct query that hasn't populated yet doesn't claim the cluster. Adds TestTerminalJobStates with parametrized terminal/transient/unknown cases as a guard against future state-set drift.

cluv passes resource requests as env-var prefixes onto the remote command, e.g. ssh rorqual 'bash --login -c "SBATCH_MEM=2G ... sbatch --parsable job.sh"' On Mila this works — sbatch reads SBATCH_MEM from its environment and honors it. On DRAC (rorqual, narval, fir, ...) the `bash --login` step re-sources /etc/profile.d/*.sh, which resets SBATCH_* defaults before sbatch ever sees them. The user's pyproject.toml ask is silently dropped and the job runs at the cluster default (e.g. ReqMem=256M). Fix: translate SBATCH_* keys with a known equivalent flag into CLI flags on the sbatch command line. Flags are parsed by sbatch directly and cannot be clobbered by anything the login shell sources afterward; they work identically on Mila and DRAC. User-facing pyproject.toml keys are unchanged — only the wire format on the remote shell changes. Any SBATCH_* key without a known flag (or any non-SBATCH env var like GIT_COMMIT) falls through as a plain env var. Verified end-to-end with SBATCH_MEM = "2G" and a 5 GiB numpy alloc: cluster ReqMem (pre) ReqMem (post) terminal state mila 2G 2G OUT_OF_MEMORY (full) rorqual 256M 2G OUT_OF_MEMORY (full)

lebrice · 2026-05-24T16:37:01Z

Thanks @wietzesuijker ! We'll take a look to confirm the issue, then review your PR. :)

codecov-commenter · 2026-05-24T16:44:39Z

Codecov Report

❌ Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.73%. Comparing base (c62c3c3) to head (e898b0e).

Files with missing lines	Patch %	Lines
cluv/cli/submit.py	87.50%	2 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##           master      #71       +/-   ##
===========================================
- Coverage   60.01%   41.73%   -18.28%     
===========================================
  Files          14       14               
  Lines        1138     1150       +12     
===========================================
- Hits          683      480      -203     
- Misses        455      670      +215

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

lebrice · 2026-05-26T14:44:36Z

By the way, for context: An alternative could also be to not use the --login flag of bash when running the sbatch commands over SSH. The reason I used it was because on some of the clusters, the uv and sbatch commands were not available with non-interactive shells.

For example:

ssh killarney which uv does not work for me
ssh killarney 'bash --login -c "which uv"' works

Same for sbatch.

I suspect this is something to do with the .profile / .bash_profile / .bashrc setup on these clusters. I am not sure if this is specific to me, or if all researchers will have the same issue.

(btw, @obilaniu , I could use some advice on this if you have any).

wietzesuijker added 3 commits May 21, 2026 16:51

wietzesuijker mentioned this pull request May 22, 2026

feat(submit): OOM-aware resubmit + learned memory estimator #73

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(submit): correct the cluv/sbatch contract on DRAC#71

fix(submit): correct the cluv/sbatch contract on DRAC#71
wietzesuijker wants to merge 3 commits into
mila-iqia:masterfrom
wietzesuijker:fix/cluv-sbatch-contract

wietzesuijker commented May 21, 2026

Uh oh!

lebrice commented May 24, 2026

Uh oh!

codecov-commenter commented May 24, 2026

Uh oh!

lebrice commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wietzesuijker commented May 21, 2026

Verification

Uh oh!

lebrice commented May 24, 2026

Uh oh!

codecov-commenter commented May 24, 2026

Codecov Report

Uh oh!

lebrice commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants