fix(submit): correct the cluv/sbatch contract on DRAC#71
Conversation
By default sacct formats the State column as fixed-width text with a 10-char cap and a trailing '+' on overflow. The only common state that overflows is OUT_OF_MEMORY (13 chars) → OUT_OF_ME+, so any consumer string-comparing against the full name silently fails. Latent on master because get_job_status is only called from submit_first, which inspects PENDING/RUNNING (both 7 chars). Fix it now anyway: --parsable2 switches sacct to pipe-separated machine output without column-width constraints, so state strings are always full.
…ansient Master had RUNNING_JOB_STATES = ["PENDING", "RUNNING"] and treated everything else as terminal. A poll that landed in COMPLETING (the post-script cleanup phase — cgroup teardown, log sync; can last seconds to minutes) would exit one step early. Same for less common transient states like SUSPENDED, REQUEUED, RESIZING, STAGE_OUT. Inverting the predicate is the more defensible shape: enumerate the terminal set (COMPLETED + failure modes) and treat anything else as "keep polling". An unknown state from a future SLURM release now defaults to safe-wait instead of silent-misclassify. The single call site in submit_first gets the inverted check with an empty-string guard so a sacct query that hasn't populated yet doesn't claim the cluster. Adds TestTerminalJobStates with parametrized terminal/transient/unknown cases as a guard against future state-set drift.
cluv passes resource requests as env-var prefixes onto the remote
command, e.g.
ssh rorqual 'bash --login -c "SBATCH_MEM=2G ... sbatch --parsable job.sh"'
On Mila this works — sbatch reads SBATCH_MEM from its environment and
honors it. On DRAC (rorqual, narval, fir, ...) the `bash --login` step
re-sources /etc/profile.d/*.sh, which resets SBATCH_* defaults before
sbatch ever sees them. The user's pyproject.toml ask is silently dropped
and the job runs at the cluster default (e.g. ReqMem=256M).
Fix: translate SBATCH_* keys with a known equivalent flag into CLI flags
on the sbatch command line. Flags are parsed by sbatch directly and
cannot be clobbered by anything the login shell sources afterward; they
work identically on Mila and DRAC.
User-facing pyproject.toml keys are unchanged — only the wire format on
the remote shell changes. Any SBATCH_* key without a known flag (or any
non-SBATCH env var like GIT_COMMIT) falls through as a plain env var.
Verified end-to-end with SBATCH_MEM = "2G" and a 5 GiB numpy alloc:
cluster ReqMem (pre) ReqMem (post) terminal state
mila 2G 2G OUT_OF_MEMORY (full)
rorqual 256M 2G OUT_OF_MEMORY (full)
|
Thanks @wietzesuijker ! We'll take a look to confirm the issue, then review your PR. :) |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #71 +/- ##
===========================================
- Coverage 60.01% 41.73% -18.28%
===========================================
Files 14 14
Lines 1138 1150 +12
===========================================
- Hits 683 480 -203
- Misses 455 670 +215 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
By the way, for context: An alternative could also be to not use the For example:
Same for sbatch. I suspect this is something to do with the .profile / .bash_profile / .bashrc setup on these clusters. I am not sure if this is specific to me, or if all researchers will have the same issue. (btw, @obilaniu , I could use some advice on this if you have any). |
Today, if you set
SBATCH_MEM = "2G"in yourpyproject.tomlandcluv submitto a DRAC cluster (rorqual, narval, fir, ...), SLURM quietly ignores you and gives the job whatever the cluster default is.sacctwill showReqMem=256M. There's no warning. Your job just runs with less memory than you asked for. Mila is fine. This is happening on every cluv submit to DRAC.The reason turns out to be a small mismatch in how cluv hands resource requests to sbatch. cluv builds the remote command like this:
The idea is that sbatch picks
SBATCH_MEMout of its environment. That works on Mila. On DRAC,bash --loginre-sources/etc/profile.d/*.sh, which resetsSBATCH_*defaults a moment before sbatch starts reading. Your value is wiped before it ever gets a chance. Two control runs on rorqual make it concrete:Same shell, same login, same partition. Only the channel changes. Env vars get clobbered; CLI flags don't.
The fix is to switch channels. Known
SBATCH_*keys get translated into the matchingsbatchCLI flags right before invoking sbatch. Flags are parsed by sbatch itself, so the login shell can't touch them. From your side as a user, nothing changes: you keep writing the same keys in yourpyproject.toml. Only the wire format on the remote shell is different:SBATCH_MEM = "2G"... --mem=2G ...SBATCH_TIME = "00:05:00"... --time=00:05:00 ...SBATCH_ACCOUNT = "rrg-foo"... --account=rrg-foo ...The full mapping lives in
SBATCH_ENV_TO_FLAGincluv/cli/submit.py. Anything not in that table (GIT_COMMIT, your own custom vars, unusualSBATCH_*keys) keeps going through as a plain env var, so your existing setups don't need to change.Two adjacent bugs in the same file showed up while debugging this and are bundled in. The first:
sacctformats itsStatecolumn as fixed-width text with a 10-character cap, andOUT_OF_MEMORYoverflows that toOUT_OF_ME+, so any code reading the full string silently misses OOMs.--parsable2switches sacct to machine-readable output with no width cap. The second: the wait loop insubmit_firsttreats anything outside["PENDING", "RUNNING"]as terminal, so a poll that catches a job inCOMPLETING(the post-script cleanup phase, which can last minutes on busy nodes) exits one step early. Inverted to an explicit terminal-state list, with a small guard for the empty-string case when sacct hasn't populated yet.Verification
End-to-end smoke with
SBATCH_MEM = "2G"and a numpy alloc that overshoots:OUT_OF_MEMORY(full)OUT_OF_MEMORY(full)Rorqual is where you can see the bug: pre-fix, the 2G ask got silently downgraded to the cluster default. Mila confirms the wire-format change doesn't break anything where the old env-var path already worked.