Skip to content

RFC: OOM-aware resubmit + learned memory estimator #76

@wietzesuijker

Description

@wietzesuijker

Problem

OOM → manual mem bump → resubmit. 1-2 human round-trips per failure, when the right value is discoverable from sacct history.

Proposal

Two opt-in capabilities, stacked PRs:

PR 1: OOM-aware retry

  • Detect OUT_OF_MEMORY via sacct --parsable2
  • Resubmit with bumped memory (configurable multiplier, max retries, ceiling)
  • Config: [tool.cluv.retry]

PR 2: Learned memory estimator (depends on PR 1)

  • Local sacct-backfilled history cache (SQLite)
  • Per-script estimator (p95 MaxRSS + headroom)
  • cluv estimate and cluv history subcommands
  • Integrates into watch loop to pre-size submissions

Interface

[tool.cluv.retry]
enabled = true
multiplier = 1.5
max_retries = 3
ceiling_gb = 64

[tool.cluv.estimate]
enabled = true
headroom = 0.2
backfill_days = 30

Working prototype

feat/oom-retry-via-salvo on PR #73. Decomposed into feat/oom-retry and feat/memory-estimator for reviewability.

Design decisions (looking for feedback)

Config namespace: Placed as peer sections of [tool.cluv] (alongside results_path, env, clusters) rather than nesting under [tool.cluv.submit.*], because the estimator also serves cluv estimate independently.

Dependency footprint: Retry adds pysalvo (pure Python, only transitive dep is pydantic, already present). Estimator uses stdlib sqlite3. No optional-dependency groups exist in cluv today, so both live in core deps. Happy to split if preferred.

OOM detection: Piggybacks on the existing watch poll via get_job_status. No additional polling or daemons, just a new branch for the OOM state.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions