RFC: OOM-aware resubmit + learned memory estimator

## Problem

OOM → manual mem bump → resubmit. 1-2 human round-trips per failure, when the right value is discoverable from sacct history.

## Proposal

Two opt-in capabilities, stacked PRs:

### [PR 1: OOM-aware retry](https://github.com/wietzesuijker/cluv/compare/master...feat/oom-retry)
- Detect `OUT_OF_MEMORY` via `sacct --parsable2`
- Resubmit with bumped memory (configurable multiplier, max retries, ceiling)
- Config: `[tool.cluv.retry]`

### [PR 2: Learned memory estimator](https://github.com/wietzesuijker/cluv/compare/feat/oom-retry...feat/memory-estimator) (depends on PR 1)
- Local sacct-backfilled history cache (SQLite)
- Per-script estimator (p95 MaxRSS + headroom)
- `cluv estimate` and `cluv history` subcommands
- Integrates into watch loop to pre-size submissions

## Interface

```toml
[tool.cluv.retry]
enabled = true
multiplier = 1.5
max_retries = 3
ceiling_gb = 64

[tool.cluv.estimate]
enabled = true
headroom = 0.2
backfill_days = 30
```

## Working prototype

[`feat/oom-retry-via-salvo`](https://github.com/wietzesuijker/cluv/tree/feat/oom-retry-via-salvo) on [PR #73](https://github.com/mila-iqia/cluv/pull/73). Decomposed into [`feat/oom-retry`](https://github.com/wietzesuijker/cluv/tree/feat/oom-retry) and [`feat/memory-estimator`](https://github.com/wietzesuijker/cluv/tree/feat/memory-estimator) for reviewability.

## Design decisions (looking for feedback)

**Config namespace:** Placed as peer sections of `[tool.cluv]` (alongside `results_path`, `env`, `clusters`) rather than nesting under `[tool.cluv.submit.*]`, because the estimator also serves `cluv estimate` independently.

**Dependency footprint:** Retry adds [`pysalvo`](https://pypi.org/project/pysalvo/) (pure Python, only transitive dep is pydantic, already present). Estimator uses stdlib `sqlite3`. No optional-dependency groups exist in cluv today, so both live in core deps. Happy to split if preferred.

**OOM detection:** Piggybacks on the existing `watch` poll via [`get_job_status`](https://github.com/wietzesuijker/cluv/blob/2434383/cluv/cli/submit.py#L444-L449). No additional polling or daemons, just a new branch for the OOM state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: OOM-aware resubmit + learned memory estimator #76

Problem

Proposal

PR 1: OOM-aware retry

PR 2: Learned memory estimator (depends on PR 1)

Interface

Working prototype

Design decisions (looking for feedback)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

RFC: OOM-aware resubmit + learned memory estimator #76

Description

Problem

Proposal

PR 1: OOM-aware retry

PR 2: Learned memory estimator (depends on PR 1)

Interface

Working prototype

Design decisions (looking for feedback)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions