Skip to content

fix(submit): correct the cluv/sbatch contract on DRAC#71

Open
wietzesuijker wants to merge 3 commits into
mila-iqia:masterfrom
wietzesuijker:fix/cluv-sbatch-contract
Open

fix(submit): correct the cluv/sbatch contract on DRAC#71
wietzesuijker wants to merge 3 commits into
mila-iqia:masterfrom
wietzesuijker:fix/cluv-sbatch-contract

Conversation

@wietzesuijker
Copy link
Copy Markdown

Today, if you set SBATCH_MEM = "2G" in your pyproject.toml and cluv submit to a DRAC cluster (rorqual, narval, fir, ...), SLURM quietly ignores you and gives the job whatever the cluster default is. sacct will show ReqMem=256M. There's no warning. Your job just runs with less memory than you asked for. Mila is fine. This is happening on every cluv submit to DRAC.

The reason turns out to be a small mismatch in how cluv hands resource requests to sbatch. cluv builds the remote command like this:

ssh rorqual 'bash --login -c "SBATCH_MEM=2G ... sbatch --parsable job.sh"'

The idea is that sbatch picks SBATCH_MEM out of its environment. That works on Mila. On DRAC, bash --login re-sources /etc/profile.d/*.sh, which resets SBATCH_* defaults a moment before sbatch starts reading. Your value is wiped before it ever gets a chance. Two control runs on rorqual make it concrete:

$ ssh rorqual 'bash --login -c "SBATCH_MEM=2G sbatch --wrap=\"sleep 30\""'  =>  ReqMem=256M
$ ssh rorqual 'bash --login -c "sbatch --mem=2G --wrap=\"sleep 30\""'        =>  ReqMem=2G

Same shell, same login, same partition. Only the channel changes. Env vars get clobbered; CLI flags don't.

The fix is to switch channels. Known SBATCH_* keys get translated into the matching sbatch CLI flags right before invoking sbatch. Flags are parsed by sbatch itself, so the login shell can't touch them. From your side as a user, nothing changes: you keep writing the same keys in your pyproject.toml. Only the wire format on the remote shell is different:

pyproject.toml (unchanged) wire format
SBATCH_MEM = "2G" ... --mem=2G ...
SBATCH_TIME = "00:05:00" ... --time=00:05:00 ...
SBATCH_ACCOUNT = "rrg-foo" ... --account=rrg-foo ...

The full mapping lives in SBATCH_ENV_TO_FLAG in cluv/cli/submit.py. Anything not in that table (GIT_COMMIT, your own custom vars, unusual SBATCH_* keys) keeps going through as a plain env var, so your existing setups don't need to change.

Two adjacent bugs in the same file showed up while debugging this and are bundled in. The first: sacct formats its State column as fixed-width text with a 10-character cap, and OUT_OF_MEMORY overflows that to OUT_OF_ME+, so any code reading the full string silently misses OOMs. --parsable2 switches sacct to machine-readable output with no width cap. The second: the wait loop in submit_first treats anything outside ["PENDING", "RUNNING"] as terminal, so a poll that catches a job in COMPLETING (the post-script cleanup phase, which can last minutes on busy nodes) exits one step early. Inverted to an explicit terminal-state list, with a small guard for the empty-string case when sacct hasn't populated yet.

Verification

End-to-end smoke with SBATCH_MEM = "2G" and a numpy alloc that overshoots:

cluster ReqMem (before) ReqMem (after) terminal state
Mila 2G 2G OUT_OF_MEMORY (full)
Rorqual 256M 2G OUT_OF_MEMORY (full)

Rorqual is where you can see the bug: pre-fix, the 2G ask got silently downgraded to the cluster default. Mila confirms the wire-format change doesn't break anything where the old env-var path already worked.

By default sacct formats the State column as fixed-width text with a
10-char cap and a trailing '+' on overflow. The only common state that
overflows is OUT_OF_MEMORY (13 chars) → OUT_OF_ME+, so any consumer
string-comparing against the full name silently fails.

Latent on master because get_job_status is only called from submit_first,
which inspects PENDING/RUNNING (both 7 chars). Fix it now anyway:
--parsable2 switches sacct to pipe-separated machine output without
column-width constraints, so state strings are always full.
…ansient

Master had RUNNING_JOB_STATES = ["PENDING", "RUNNING"] and treated
everything else as terminal. A poll that landed in COMPLETING (the
post-script cleanup phase — cgroup teardown, log sync; can last seconds
to minutes) would exit one step early. Same for less common transient
states like SUSPENDED, REQUEUED, RESIZING, STAGE_OUT.

Inverting the predicate is the more defensible shape: enumerate the
terminal set (COMPLETED + failure modes) and treat anything else as
"keep polling". An unknown state from a future SLURM release now
defaults to safe-wait instead of silent-misclassify.

The single call site in submit_first gets the inverted check with an
empty-string guard so a sacct query that hasn't populated yet doesn't
claim the cluster.

Adds TestTerminalJobStates with parametrized terminal/transient/unknown
cases as a guard against future state-set drift.
cluv passes resource requests as env-var prefixes onto the remote
command, e.g.

    ssh rorqual 'bash --login -c "SBATCH_MEM=2G ... sbatch --parsable job.sh"'

On Mila this works — sbatch reads SBATCH_MEM from its environment and
honors it. On DRAC (rorqual, narval, fir, ...) the `bash --login` step
re-sources /etc/profile.d/*.sh, which resets SBATCH_* defaults before
sbatch ever sees them. The user's pyproject.toml ask is silently dropped
and the job runs at the cluster default (e.g. ReqMem=256M).

Fix: translate SBATCH_* keys with a known equivalent flag into CLI flags
on the sbatch command line. Flags are parsed by sbatch directly and
cannot be clobbered by anything the login shell sources afterward; they
work identically on Mila and DRAC.

User-facing pyproject.toml keys are unchanged — only the wire format on
the remote shell changes. Any SBATCH_* key without a known flag (or any
non-SBATCH env var like GIT_COMMIT) falls through as a plain env var.

Verified end-to-end with SBATCH_MEM = "2G" and a 5 GiB numpy alloc:

    cluster  ReqMem (pre)  ReqMem (post)  terminal state
    mila     2G            2G             OUT_OF_MEMORY (full)
    rorqual  256M          2G             OUT_OF_MEMORY (full)
@lebrice
Copy link
Copy Markdown
Contributor

lebrice commented May 24, 2026

Thanks @wietzesuijker ! We'll take a look to confirm the issue, then review your PR. :)

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 87.50000% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 41.73%. Comparing base (c62c3c3) to head (e898b0e).

Files with missing lines Patch % Lines
cluv/cli/submit.py 87.50% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master      #71       +/-   ##
===========================================
- Coverage   60.01%   41.73%   -18.28%     
===========================================
  Files          14       14               
  Lines        1138     1150       +12     
===========================================
- Hits          683      480      -203     
- Misses        455      670      +215     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@lebrice
Copy link
Copy Markdown
Contributor

lebrice commented May 26, 2026

By the way, for context: An alternative could also be to not use the --login flag of bash when running the sbatch commands over SSH. The reason I used it was because on some of the clusters, the uv and sbatch commands were not available with non-interactive shells.

For example:

  • ssh killarney which uv does not work for me
  • ssh killarney 'bash --login -c "which uv"' works

Same for sbatch.

I suspect this is something to do with the .profile / .bash_profile / .bashrc setup on these clusters. I am not sure if this is specific to me, or if all researchers will have the same issue.

(btw, @obilaniu , I could use some advice on this if you have any).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants