feat(submit): OOM-aware resubmit + learned memory estimator#73
Open
wietzesuijker wants to merge 12 commits into
Open
feat(submit): OOM-aware resubmit + learned memory estimator#73wietzesuijker wants to merge 12 commits into
wietzesuijker wants to merge 12 commits into
Conversation
By default sacct formats the State column as fixed-width text with a 10-char cap and a trailing '+' on overflow. The only common state that overflows is OUT_OF_MEMORY (13 chars) → OUT_OF_ME+, so any consumer string-comparing against the full name silently fails. Latent on master because get_job_status is only called from submit_first, which inspects PENDING/RUNNING (both 7 chars). Fix it now anyway: --parsable2 switches sacct to pipe-separated machine output without column-width constraints, so state strings are always full.
…ansient Master had RUNNING_JOB_STATES = ["PENDING", "RUNNING"] and treated everything else as terminal. A poll that landed in COMPLETING (the post-script cleanup phase — cgroup teardown, log sync; can last seconds to minutes) would exit one step early. Same for less common transient states like SUSPENDED, REQUEUED, RESIZING, STAGE_OUT. Inverting the predicate is the more defensible shape: enumerate the terminal set (COMPLETED + failure modes) and treat anything else as "keep polling". An unknown state from a future SLURM release now defaults to safe-wait instead of silent-misclassify. The single call site in submit_first gets the inverted check with an empty-string guard so a sacct query that hasn't populated yet doesn't claim the cluster. Adds TestTerminalJobStates with parametrized terminal/transient/unknown cases as a guard against future state-set drift.
cluv passes resource requests as env-var prefixes onto the remote
command, e.g.
ssh rorqual 'bash --login -c "SBATCH_MEM=2G ... sbatch --parsable job.sh"'
On Mila this works — sbatch reads SBATCH_MEM from its environment and
honors it. On DRAC (rorqual, narval, fir, ...) the `bash --login` step
re-sources /etc/profile.d/*.sh, which resets SBATCH_* defaults before
sbatch ever sees them. The user's pyproject.toml ask is silently dropped
and the job runs at the cluster default (e.g. ReqMem=256M).
Fix: translate SBATCH_* keys with a known equivalent flag into CLI flags
on the sbatch command line. Flags are parsed by sbatch directly and
cannot be clobbered by anything the login shell sources afterward; they
work identically on Mila and DRAC.
User-facing pyproject.toml keys are unchanged — only the wire format on
the remote shell changes. Any SBATCH_* key without a known flag (or any
non-SBATCH env var like GIT_COMMIT) falls through as a plain env var.
Verified end-to-end with SBATCH_MEM = "2G" and a 5 GiB numpy alloc:
cluster ReqMem (pre) ReqMem (post) terminal state
mila 2G 2G OUT_OF_MEMORY (full)
rorqual 256M 2G OUT_OF_MEMORY (full)
Introduces opt-in `RetryConfig` (on_oom DSL list + max_hops) attached as `CluvConfig.retry`. Default is None so existing pyproject.toml files are unaffected. DSL strings are parsed at first use (by salvo.policy.parse), not at config load, so cluv.config stays free of a pysalvo import.
Adds `get_max_rss_mb(remote, job_id)` that walks sacct rows (`--format=MaxRSS --units=M --parsable2`) and returns the peak step RSS in MiB, or None when sacct reports nothing parseable. Kept separate from `get_job_status` so `submit_first`'s polling stays a single-column read. Used to populate `salvo.policy.OomContext.max_rss_mb` in the upcoming retry loop; `apply_oom` is designed to also accept None and fall back to the multiplicative factor alone (per open-question 4 in the proposal).
Opt-in. With [tool.cluv.retry] set in pyproject.toml, after sbatch returns and the job state becomes OUT_OF_MEMORY, submit() asks salvo.policy.apply_oom for the next memory ask, mutates the env-var dict that get_sbatch_command already builds, and calls sbatch again on the same remote until the policy says fail or max_hops is reached. - pysalvo>=0.2 added to dependencies (pulled in lazily inside the retry path so users without [tool.cluv.retry] take no import cost) - six fake-remote tests pin the loop semantics: bumps mem, grows across hops, caps at max_hops, terminates on fail-step, returns immediately on non-OOM terminal, no-op when retry is None - CLUV_HOP env var threaded through env_overrides for the running job to read
JSON cache at ~/.cache/cluv/history/<cluster>/<key>.json (override via CLUV_HISTORY_DIR). Records are deduped by job_id and written atomically. The sacct backfill stamps cluv submissions with a Comment field "cluv:v1:<key>" so a single sacct call can recover history per spec.
Stamps each submission with --comment=cluv:v1:<spec_key> so sacct backfill can identify cluv jobs later. When [tool.cluv.estimate] is configured, _resolve_estimate loads the local cache (backfilling from sacct on cold cache) and overrides SBATCH_MEM with the estimator's prediction. Renames _retry_on_oom to _watch_job_chain and threads write_back through it so each terminal job persists a JobRecord for future estimates. Both retry-only and estimate-only paths reuse the same loop.
cluv estimate <cluster> <job.sh> [-- program-args...] reproduces the spec_key and dry-runs the estimator, printing the historical records, P95, growth slope, confidence, and the SBATCH_MEM it would set. cluv history list|backfill|clear lets users inspect and manage the local cache that the estimator and the retry loop both feed.
salvo.history (cluv history backend) ships from 0.3.0; lock follows the v0.3.1 release for --mem-per-gpu / --qos=long, Mila login-N hostname detection, salvo.runtime.entrypoint, preempt resubmit, and concurrency hardening.
4ee5c97 to
e14db2b
Compare
Contributor
|
Hey @wietzesuijker , this is too much code, without a prior issue and design discussion, for us to approve it. I'll leave it open for now so that we may play with it, but it is unlikely that we would merge such a PR without a thorough design discussion and code review. |
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
feat(submit): OOM-aware resubmit + learned memory estimator
Two opt-in capabilities so cluv handles memory sizing for you:
OUT_OF_MEMORY, cluv resubmits with bumpedSBATCH_MEMinstead of surfacing the failure. Each miss costs one extra hop instead of a manual edit + push + resubmit.(script, commit, args)tuple and asks for what your last run actually used. You stop tracking memory sizes by hand.Both are off by default. The only unconditional change is a
--comment=cluv:v1:<spec_key>stamp on every submission, harmless without the estimator.Turning them on
Both blocks accept more knobs (
max_hops,safety,window,min_samples,backfill); seeRetryConfigandEstimateConfig.Evidence
5 GiB float64 allocation, starting from
SBATCH_MEM = "2G":Job under test
The warm prediction came from the three rows the cold submit wrote. The Mila and Rorqual values differ because sacct samples peak memory differently on each cluster; the estimator handles a known artifact (
MaxRSS << ReqMemon a COMPLETED row) by falling back to the request, which is why Mila lands at 9830M.How identity works
spec_key = blake2s(job_script + git_commit + program_args). The first submit for a new key pulls 30 days of sacct on that cluster and seeds a local cache (~/.cache/cluv/history/<cluster>/<key>.json); subsequent submits read it. After each terminal state, cluv writes one row back.Inspecting the cache
AI (Claude) supported my development of this PR.