Skip to content

feat(submit): OOM-aware resubmit + learned memory estimator#73

Open
wietzesuijker wants to merge 12 commits into
mila-iqia:masterfrom
wietzesuijker:feat/oom-retry-via-salvo
Open

feat(submit): OOM-aware resubmit + learned memory estimator#73
wietzesuijker wants to merge 12 commits into
mila-iqia:masterfrom
wietzesuijker:feat/oom-retry-via-salvo

Conversation

@wietzesuijker
Copy link
Copy Markdown

@wietzesuijker wietzesuijker commented May 22, 2026

feat(submit): OOM-aware resubmit + learned memory estimator

Two opt-in capabilities so cluv handles memory sizing for you:

  • Retry: when sacct reports OUT_OF_MEMORY, cluv resubmits with bumped SBATCH_MEM instead of surfacing the failure. Each miss costs one extra hop instead of a manual edit + push + resubmit.
  • Estimator: before each submit, cluv reads past sacct rows for the same (script, commit, args) tuple and asks for what your last run actually used. You stop tracking memory sizes by hand.

Both are off by default. The only unconditional change is a --comment=cluv:v1:<spec_key> stamp on every submission, harmless without the estimator.

Turning them on

[tool.cluv.retry]
on_oom = ["bump_mem(2x, max=16G)", "fail"]

[tool.cluv.estimate]
enabled = true

Both blocks accept more knobs (max_hops, safety, window, min_samples, backfill); see RetryConfig and EstimateConfig.

Evidence

5 GiB float64 allocation, starting from SBATCH_MEM = "2G":

cluster cold submit warm submit (predicted)
Mila 2G OOM → 4G OOM → 8G COMPLETED 9830M, single COMPLETED
Rorqual 2G OOM → 4G OOM → 8G COMPLETED 6284M, single COMPLETED
Job under test
import numpy as np
n = int(5.0 * (1024**3) / 8)
arr = np.ones(n, dtype=np.float64)
arr += 1.0

The warm prediction came from the three rows the cold submit wrote. The Mila and Rorqual values differ because sacct samples peak memory differently on each cluster; the estimator handles a known artifact (MaxRSS << ReqMem on a COMPLETED row) by falling back to the request, which is why Mila lands at 9830M.

How identity works

spec_key = blake2s(job_script + git_commit + program_args). The first submit for a new key pulls 30 days of sacct on that cluster and seeds a local cache (~/.cache/cluv/history/<cluster>/<key>.json); subsequent submits read it. After each terminal state, cluv writes one row back.

Inspecting the cache

cluv estimate <cluster> <job.sh> [-- program-args...]   # dry-run what the next submit would ask for
cluv history list                                       # all (cluster, key) pairs
cluv history backfill <cluster>                         # pull tagged jobs from sacct
cluv history clear [<cluster>] [--key <key>]            # drop a key, cluster, or everything

First 3 commits are #71; review there.


AI (Claude) supported my development of this PR.

By default sacct formats the State column as fixed-width text with a
10-char cap and a trailing '+' on overflow. The only common state that
overflows is OUT_OF_MEMORY (13 chars) → OUT_OF_ME+, so any consumer
string-comparing against the full name silently fails.

Latent on master because get_job_status is only called from submit_first,
which inspects PENDING/RUNNING (both 7 chars). Fix it now anyway:
--parsable2 switches sacct to pipe-separated machine output without
column-width constraints, so state strings are always full.
…ansient

Master had RUNNING_JOB_STATES = ["PENDING", "RUNNING"] and treated
everything else as terminal. A poll that landed in COMPLETING (the
post-script cleanup phase — cgroup teardown, log sync; can last seconds
to minutes) would exit one step early. Same for less common transient
states like SUSPENDED, REQUEUED, RESIZING, STAGE_OUT.

Inverting the predicate is the more defensible shape: enumerate the
terminal set (COMPLETED + failure modes) and treat anything else as
"keep polling". An unknown state from a future SLURM release now
defaults to safe-wait instead of silent-misclassify.

The single call site in submit_first gets the inverted check with an
empty-string guard so a sacct query that hasn't populated yet doesn't
claim the cluster.

Adds TestTerminalJobStates with parametrized terminal/transient/unknown
cases as a guard against future state-set drift.
cluv passes resource requests as env-var prefixes onto the remote
command, e.g.

    ssh rorqual 'bash --login -c "SBATCH_MEM=2G ... sbatch --parsable job.sh"'

On Mila this works — sbatch reads SBATCH_MEM from its environment and
honors it. On DRAC (rorqual, narval, fir, ...) the `bash --login` step
re-sources /etc/profile.d/*.sh, which resets SBATCH_* defaults before
sbatch ever sees them. The user's pyproject.toml ask is silently dropped
and the job runs at the cluster default (e.g. ReqMem=256M).

Fix: translate SBATCH_* keys with a known equivalent flag into CLI flags
on the sbatch command line. Flags are parsed by sbatch directly and
cannot be clobbered by anything the login shell sources afterward; they
work identically on Mila and DRAC.

User-facing pyproject.toml keys are unchanged — only the wire format on
the remote shell changes. Any SBATCH_* key without a known flag (or any
non-SBATCH env var like GIT_COMMIT) falls through as a plain env var.

Verified end-to-end with SBATCH_MEM = "2G" and a 5 GiB numpy alloc:

    cluster  ReqMem (pre)  ReqMem (post)  terminal state
    mila     2G            2G             OUT_OF_MEMORY (full)
    rorqual  256M          2G             OUT_OF_MEMORY (full)
Introduces opt-in `RetryConfig` (on_oom DSL list + max_hops) attached as
`CluvConfig.retry`. Default is None so existing pyproject.toml files are
unaffected. DSL strings are parsed at first use (by salvo.policy.parse),
not at config load, so cluv.config stays free of a pysalvo import.
Adds `get_max_rss_mb(remote, job_id)` that walks sacct rows
(`--format=MaxRSS --units=M --parsable2`) and returns the peak step RSS
in MiB, or None when sacct reports nothing parseable. Kept separate from
`get_job_status` so `submit_first`'s polling stays a single-column read.

Used to populate `salvo.policy.OomContext.max_rss_mb` in the upcoming
retry loop; `apply_oom` is designed to also accept None and fall back to
the multiplicative factor alone (per open-question 4 in the proposal).
Opt-in. With [tool.cluv.retry] set in pyproject.toml, after sbatch
returns and the job state becomes OUT_OF_MEMORY, submit() asks
salvo.policy.apply_oom for the next memory ask, mutates the env-var
dict that get_sbatch_command already builds, and calls sbatch again
on the same remote until the policy says fail or max_hops is reached.

- pysalvo>=0.2 added to dependencies (pulled in lazily inside the
  retry path so users without [tool.cluv.retry] take no import cost)
- six fake-remote tests pin the loop semantics: bumps mem, grows
  across hops, caps at max_hops, terminates on fail-step, returns
  immediately on non-OOM terminal, no-op when retry is None
- CLUV_HOP env var threaded through env_overrides for the running
  job to read
JSON cache at ~/.cache/cluv/history/<cluster>/<key>.json (override via
CLUV_HISTORY_DIR). Records are deduped by job_id and written atomically.
The sacct backfill stamps cluv submissions with a Comment field
"cluv:v1:<key>" so a single sacct call can recover history per spec.
Stamps each submission with --comment=cluv:v1:<spec_key> so sacct
backfill can identify cluv jobs later. When [tool.cluv.estimate] is
configured, _resolve_estimate loads the local cache (backfilling from
sacct on cold cache) and overrides SBATCH_MEM with the estimator's
prediction. Renames _retry_on_oom to _watch_job_chain and threads
write_back through it so each terminal job persists a JobRecord for
future estimates. Both retry-only and estimate-only paths reuse the
same loop.
cluv estimate <cluster> <job.sh> [-- program-args...] reproduces the
spec_key and dry-runs the estimator, printing the historical records,
P95, growth slope, confidence, and the SBATCH_MEM it would set.

cluv history list|backfill|clear lets users inspect and manage the
local cache that the estimator and the retry loop both feed.
salvo.history (cluv history backend) ships from 0.3.0; lock follows the
v0.3.1 release for --mem-per-gpu / --qos=long, Mila login-N hostname
detection, salvo.runtime.entrypoint, preempt resubmit, and concurrency
hardening.
@wietzesuijker wietzesuijker force-pushed the feat/oom-retry-via-salvo branch from 4ee5c97 to e14db2b Compare May 23, 2026 01:56
@lebrice
Copy link
Copy Markdown
Contributor

lebrice commented May 24, 2026

Hey @wietzesuijker , this is too much code, without a prior issue and design discussion, for us to approve it. I'll leave it open for now so that we may play with it, but it is unlikely that we would merge such a PR without a thorough design discussion and code review.

@wietzesuijker
Copy link
Copy Markdown
Author

Thanks @lebrice. I suggest prioritizing #71 and #72 first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants