Problem
OOM → manual mem bump → resubmit. 1-2 human round-trips per failure, when the right value is discoverable from sacct history.
Proposal
Two opt-in capabilities, stacked PRs:
- Detect
OUT_OF_MEMORY via sacct --parsable2
- Resubmit with bumped memory (configurable multiplier, max retries, ceiling)
- Config:
[tool.cluv.retry]
- Local sacct-backfilled history cache (SQLite)
- Per-script estimator (p95 MaxRSS + headroom)
cluv estimate and cluv history subcommands
- Integrates into watch loop to pre-size submissions
Interface
[tool.cluv.retry]
enabled = true
multiplier = 1.5
max_retries = 3
ceiling_gb = 64
[tool.cluv.estimate]
enabled = true
headroom = 0.2
backfill_days = 30
Working prototype
feat/oom-retry-via-salvo on PR #73. Decomposed into feat/oom-retry and feat/memory-estimator for reviewability.
Design decisions (looking for feedback)
Config namespace: Placed as peer sections of [tool.cluv] (alongside results_path, env, clusters) rather than nesting under [tool.cluv.submit.*], because the estimator also serves cluv estimate independently.
Dependency footprint: Retry adds pysalvo (pure Python, only transitive dep is pydantic, already present). Estimator uses stdlib sqlite3. No optional-dependency groups exist in cluv today, so both live in core deps. Happy to split if preferred.
OOM detection: Piggybacks on the existing watch poll via get_job_status. No additional polling or daemons, just a new branch for the OOM state.
Problem
OOM → manual mem bump → resubmit. 1-2 human round-trips per failure, when the right value is discoverable from sacct history.
Proposal
Two opt-in capabilities, stacked PRs:
PR 1: OOM-aware retry
OUT_OF_MEMORYviasacct --parsable2[tool.cluv.retry]PR 2: Learned memory estimator (depends on PR 1)
cluv estimateandcluv historysubcommandsInterface
Working prototype
feat/oom-retry-via-salvoon PR #73. Decomposed intofeat/oom-retryandfeat/memory-estimatorfor reviewability.Design decisions (looking for feedback)
Config namespace: Placed as peer sections of
[tool.cluv](alongsideresults_path,env,clusters) rather than nesting under[tool.cluv.submit.*], because the estimator also servescluv estimateindependently.Dependency footprint: Retry adds
pysalvo(pure Python, only transitive dep is pydantic, already present). Estimator uses stdlibsqlite3. No optional-dependency groups exist in cluv today, so both live in core deps. Happy to split if preferred.OOM detection: Piggybacks on the existing
watchpoll viaget_job_status. No additional polling or daemons, just a new branch for the OOM state.