feat(submit): OOM-aware resubmit + learned memory estimator by wietzesuijker · Pull Request #73 · mila-iqia/cluv

wietzesuijker · 2026-05-22T01:52:40Z

feat(submit): OOM-aware resubmit + learned memory estimator

Two opt-in capabilities so cluv handles memory sizing for you:

Retry: when sacct reports OUT_OF_MEMORY, cluv resubmits with bumped SBATCH_MEM instead of surfacing the failure. Each miss costs one extra hop instead of a manual edit + push + resubmit.
Estimator: before each submit, cluv reads past sacct rows for the same (script, commit, args) tuple and asks for what your last run actually used. You stop tracking memory sizes by hand.

Both are off by default. The only unconditional change is a --comment=cluv:v1:<spec_key> stamp on every submission, harmless without the estimator.

Turning them on

[tool.cluv.retry]
on_oom = ["bump_mem(2x, max=16G)", "fail"]

[tool.cluv.estimate]
enabled = true

Both blocks accept more knobs (max_hops, safety, window, min_samples, backfill); see RetryConfig and EstimateConfig.

Evidence

5 GiB float64 allocation, starting from SBATCH_MEM = "2G":

cluster	cold submit	warm submit (predicted)
Mila	2G OOM → 4G OOM → 8G COMPLETED	9830M, single COMPLETED
Rorqual	2G OOM → 4G OOM → 8G COMPLETED	6284M, single COMPLETED

Job under test

import numpy as np
n = int(5.0 * (1024**3) / 8)
arr = np.ones(n, dtype=np.float64)
arr += 1.0

The warm prediction came from the three rows the cold submit wrote. The Mila and Rorqual values differ because sacct samples peak memory differently on each cluster; the estimator handles a known artifact (MaxRSS << ReqMem on a COMPLETED row) by falling back to the request, which is why Mila lands at 9830M.

How identity works

spec_key = blake2s(job_script + git_commit + program_args). The first submit for a new key pulls 30 days of sacct on that cluster and seeds a local cache (~/.cache/cluv/history/<cluster>/<key>.json); subsequent submits read it. After each terminal state, cluv writes one row back.

Inspecting the cache

cluv estimate <cluster> <job.sh> [-- program-args...]   # dry-run what the next submit would ask for
cluv history list                                       # all (cluster, key) pairs
cluv history backfill <cluster>                         # pull tagged jobs from sacct
cluv history clear [<cluster>] [--key <key>]            # drop a key, cluster, or everything

First 3 commits are #71; review there.

AI (Claude) supported my development of this PR.

By default sacct formats the State column as fixed-width text with a 10-char cap and a trailing '+' on overflow. The only common state that overflows is OUT_OF_MEMORY (13 chars) → OUT_OF_ME+, so any consumer string-comparing against the full name silently fails. Latent on master because get_job_status is only called from submit_first, which inspects PENDING/RUNNING (both 7 chars). Fix it now anyway: --parsable2 switches sacct to pipe-separated machine output without column-width constraints, so state strings are always full.

…ansient Master had RUNNING_JOB_STATES = ["PENDING", "RUNNING"] and treated everything else as terminal. A poll that landed in COMPLETING (the post-script cleanup phase — cgroup teardown, log sync; can last seconds to minutes) would exit one step early. Same for less common transient states like SUSPENDED, REQUEUED, RESIZING, STAGE_OUT. Inverting the predicate is the more defensible shape: enumerate the terminal set (COMPLETED + failure modes) and treat anything else as "keep polling". An unknown state from a future SLURM release now defaults to safe-wait instead of silent-misclassify. The single call site in submit_first gets the inverted check with an empty-string guard so a sacct query that hasn't populated yet doesn't claim the cluster. Adds TestTerminalJobStates with parametrized terminal/transient/unknown cases as a guard against future state-set drift.

cluv passes resource requests as env-var prefixes onto the remote command, e.g. ssh rorqual 'bash --login -c "SBATCH_MEM=2G ... sbatch --parsable job.sh"' On Mila this works — sbatch reads SBATCH_MEM from its environment and honors it. On DRAC (rorqual, narval, fir, ...) the `bash --login` step re-sources /etc/profile.d/*.sh, which resets SBATCH_* defaults before sbatch ever sees them. The user's pyproject.toml ask is silently dropped and the job runs at the cluster default (e.g. ReqMem=256M). Fix: translate SBATCH_* keys with a known equivalent flag into CLI flags on the sbatch command line. Flags are parsed by sbatch directly and cannot be clobbered by anything the login shell sources afterward; they work identically on Mila and DRAC. User-facing pyproject.toml keys are unchanged — only the wire format on the remote shell changes. Any SBATCH_* key without a known flag (or any non-SBATCH env var like GIT_COMMIT) falls through as a plain env var. Verified end-to-end with SBATCH_MEM = "2G" and a 5 GiB numpy alloc: cluster ReqMem (pre) ReqMem (post) terminal state mila 2G 2G OUT_OF_MEMORY (full) rorqual 256M 2G OUT_OF_MEMORY (full)

Introduces opt-in `RetryConfig` (on_oom DSL list + max_hops) attached as `CluvConfig.retry`. Default is None so existing pyproject.toml files are unaffected. DSL strings are parsed at first use (by salvo.policy.parse), not at config load, so cluv.config stays free of a pysalvo import.

Adds `get_max_rss_mb(remote, job_id)` that walks sacct rows (`--format=MaxRSS --units=M --parsable2`) and returns the peak step RSS in MiB, or None when sacct reports nothing parseable. Kept separate from `get_job_status` so `submit_first`'s polling stays a single-column read. Used to populate `salvo.policy.OomContext.max_rss_mb` in the upcoming retry loop; `apply_oom` is designed to also accept None and fall back to the multiplicative factor alone (per open-question 4 in the proposal).

Opt-in. With [tool.cluv.retry] set in pyproject.toml, after sbatch returns and the job state becomes OUT_OF_MEMORY, submit() asks salvo.policy.apply_oom for the next memory ask, mutates the env-var dict that get_sbatch_command already builds, and calls sbatch again on the same remote until the policy says fail or max_hops is reached. - pysalvo>=0.2 added to dependencies (pulled in lazily inside the retry path so users without [tool.cluv.retry] take no import cost) - six fake-remote tests pin the loop semantics: bumps mem, grows across hops, caps at max_hops, terminates on fail-step, returns immediately on non-OOM terminal, no-op when retry is None - CLUV_HOP env var threaded through env_overrides for the running job to read

…t-in

JSON cache at ~/.cache/cluv/history/<cluster>/<key>.json (override via CLUV_HISTORY_DIR). Records are deduped by job_id and written atomically. The sacct backfill stamps cluv submissions with a Comment field "cluv:v1:<key>" so a single sacct call can recover history per spec.

Stamps each submission with --comment=cluv:v1:<spec_key> so sacct backfill can identify cluv jobs later. When [tool.cluv.estimate] is configured, _resolve_estimate loads the local cache (backfilling from sacct on cold cache) and overrides SBATCH_MEM with the estimator's prediction. Renames _retry_on_oom to _watch_job_chain and threads write_back through it so each terminal job persists a JobRecord for future estimates. Both retry-only and estimate-only paths reuse the same loop.

cluv estimate <cluster> <job.sh> [-- program-args...] reproduces the spec_key and dry-runs the estimator, printing the historical records, P95, growth slope, confidence, and the SBATCH_MEM it would set. cluv history list|backfill|clear lets users inspect and manage the local cache that the estimator and the retry loop both feed.

salvo.history (cluv history backend) ships from 0.3.0; lock follows the v0.3.1 release for --mem-per-gpu / --qos=long, Mila login-N hostname detection, salvo.runtime.entrypoint, preempt resubmit, and concurrency hardening.

lebrice · 2026-05-24T16:43:35Z

Hey @wietzesuijker , this is too much code, without a prior issue and design discussion, for us to approve it. I'll leave it open for now so that we may play with it, but it is unlikely that we would merge such a PR without a thorough design discussion and code review.

wietzesuijker · 2026-05-24T18:49:38Z

Thanks @lebrice. I suggest prioritizing #71 and #72 first.

wietzesuijker added 12 commits May 21, 2026 16:51

feat(config): add [tool.cluv.estimate] schema for memory estimator op…

ff854ad

…t-in

fix(history): cap default backfill window at 30 days (mila sacct limit)

a0c5bf5

chore(deps): pin pysalvo>=0.3 and lock to 0.3.1

e14db2b

salvo.history (cluv history backend) ships from 0.3.0; lock follows the v0.3.1 release for --mem-per-gpu / --qos=long, Mila login-N hostname detection, salvo.runtime.entrypoint, preempt resubmit, and concurrency hardening.

wietzesuijker force-pushed the feat/oom-retry-via-salvo branch from 4ee5c97 to e14db2b Compare May 23, 2026 01:56

wietzesuijker mentioned this pull request May 25, 2026

RFC: OOM-aware resubmit + learned memory estimator #76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(submit): OOM-aware resubmit + learned memory estimator#73

feat(submit): OOM-aware resubmit + learned memory estimator#73
wietzesuijker wants to merge 12 commits into
mila-iqia:masterfrom
wietzesuijker:feat/oom-retry-via-salvo

wietzesuijker commented May 22, 2026 •

edited

Loading

Uh oh!

lebrice commented May 24, 2026

Uh oh!

wietzesuijker commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wietzesuijker commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

feat(submit): OOM-aware resubmit + learned memory estimator

Turning them on

Evidence

How identity works

Inspecting the cache

Uh oh!

lebrice commented May 24, 2026

Uh oh!

wietzesuijker commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wietzesuijker commented May 22, 2026 •

edited

Loading