program.md: the auto-optim optimization loop (the runbook)

This is the "skill" the agent runs over and over to minimize Wan 2.2 VAE decode latency without failing the quality gate. You (the human) iterate on THIS file; the agent iterates on optimize.py. You program the org, not the Python.

This is a synthesis of the field's working practice, Karpathy's autoresearch loop, the GPU kernel-optimization agents (KernelBench, Sakana robust-kbench, Meta KernelAgent, Kevin-32B, GPU MODE), and Anthropic's long-running-agent guidance. Sources are listed at the bottom. AGENTS.md is the short always-loaded contract; this file is the full procedure, loaded when you run the loop.

0. The one principle

The verifier is adversarial-grade infrastructure, not a test you wrote for yourself. Three independent groups (Sakana, Cognition, GPU MODE) caught their agent cheating the evaluator before it cheated the problem. Sakana's headline 3.13× collapsed to 1.49× once benchmark-gaming kernels were removed. Everything below exists so that the single number we optimize, latency, cannot be moved by anything except a genuinely faster, genuinely correct decode.

1. Roles & files

File	Status	Who edits
`optimize.py` (`build_decoder`)	MUTABLE, the only thing the agent changes	agent
`goals.json`	FROZEN target/thresholds/stop-conditions (JSON on purpose: models tamper with JSON far less than Markdown)	human only
`harness/{bench,verify,make_reference,run_experiment,_common}.py`	FROZEN evaluator	human only
`model/`, `assets/`	FROZEN model + checkpoint + frozen reference	human only
`results.tsv`	append-only ledger (driver writes; survives `git reset`)	driver
`leaderboard.md`	generated best-so-far	driver
`PROGRESS.md`	journal + dead-end log + per-round reflexion	agent
`program.md` (this file)	the methodology	human only

Two control loops, kept separate: the human optimizes this spec; the agent optimizes the code. The frozen evaluator is the integrity guarantee, the metric can't be gamed because the agent can't touch the scorer.

2. Setup (once, with the human)

Agree a run tag (e.g. jun25); the branch autooptim/<tag> must not already exist.
git checkout -b autooptim/<tag>.
Read AGENTS.md, goals.json, this file, and PROGRESS.md (best-so-far + dead ends).
Confirm a free GPU (nvidia-smi) and export CUDA_VISIBLE_DEVICES=<free idx>. Never benchmark on a busy GPU, timings become noise.
Ensure assets/reference_fp32.pt exists, else python harness/make_reference.py.
First experiment is ALWAYS the untouched baseline (establishes 1.0×).

3. The objective & the gate

One scalar: mean decode latency (s) of assets/latent_3s.pt, subject to the hard quality gate in goals.json. A change that fails the gate has fitness = ∞ (discarded).

The gate compares the decode against the frozen fp32 reference (the same decoder, slow and exact): shape-match AND max_abs_diff ≤ tol AND PSNR ≥ floor. This measures exactly the right question, "did the optimization change the output?", and nothing else gets a vote. Speedup is always reported against the fixed baseline (run #0).

4. The experiment loop (LOOP, do not stop on your own)

LOOP:
  1. Orient: read PROGRESS.md (best + dead-ends) and the top of leaderboard.md.
  2. Diagnose from the LAST profile. Every KEEP auto-runs harness/profile.py (warmup +
       detailed profiler), appends the breakdown to PROGRESS.md, and saves a full report +
       a chrome/perfetto trace under runs/. READ IT before hypothesizing. Two numbers decide
       the next move:
         - GPU util HIGH -> COMPUTE-bound: attack the top kernel/op (dtype, layout, fusion).
         - GPU util LOW / idle HIGH -> HOST-bound: the GPU is waiting (launch overhead,
           CPU<->GPU serialization, Python, gaps). Recover it with fewer/larger kernels,
           overlap, CUDA graphs, or compiling a bigger region. The biggest wins hide here.
       No current profile? Run it: CUDA_VISIBLE_DEVICES=$GPU python harness/run_profile.py --exp <n>
  3. Hypothesize ONE change that attacks what the profile shows. Prefer (in order):
       a) the profiled top cost;
       b) an untried technique from "candidate directions" in PROGRESS.md;
       c) a variation of the last near-miss.
  4. Implement it, edit ONLY optimize.py, inside build_decoder.
  5. Evaluate: python harness/run_experiment.py --desc "<=10 words"
       (the driver runs verify.py THEN bench.py, logs, and keeps-or-reverts).
  6. Record in PROGRESS.md (the reflexion entry, §6).
  7. On a KEEP: snapshot optimize.py to experiments/exp_NNN_desc.py.
  8. Go to 1.

Search discipline (from the kernel agents): keep a mental beam of the top ~2 configs and explore distinct bottleneck directions rather than re-tuning one knob; when you show yourself prior attempts, order them slowest→fastest so the trajectory is legible.

4b. Act on your diagnosis, do NOT nibble knobs

Every profile names a #1 reducible cost. Your next hypothesis MUST attack that cost. Sort your possible moves into two kinds:

Knob tweaks, compile mode/flags, cudnn.benchmark, autotune options, dtype/layout toggles. Cheap, and quickly exhausted.
Structural moves, fuse the named hot op into the compiled region, eliminate copies / clones / format-conversions, restructure the hot loop, fold an eager residual into the graph, change how the streaming cache works. Harder, and this is where the remaining wins live once the knobs are spent.

Anti-nibbling rule. A change under ~1% of the current best is within run-to-run noise, it is NOT progress, even if the driver keeps it. If your last 2 kept experiments each gained < 1%, you are nibbling: stop tuning knobs and make a STRUCTURAL change that attacks the current #1 profiled cost, even if it's harder or riskier. Re-read your own "Current bottleneck" note and go after it directly, do not reach for another flag.

If you have genuinely attempted the structural moves the profile points to and only sub-1% noise remains, you have found the floor: reproduce the best once, summarize the milestone ladder in PROGRESS.md, and stop (that is a legitimate result, not a failure).

5. Accept / reject (parsimony + regression guard)

Keep only if it PASSES the gate AND is faster than the current best (driver-enforced).
Parsimony: all else equal, simpler wins. A change that holds latency flat but deletes complexity is a keep. A 0.5% speedup that adds 30 lines of fragile code is a discard. A ~0% change from removing code: keep.
Regression guard: the driver reverts optimize.py on any non-improving result, so a regression can never accumulate. The best-known-good config is always recoverable.
Soft resource ceiling: peak memory may rise some for a real latency win, but must not blow up (see goals.json); ask the human before exceeding the stated cap.

6. Logging & memory (the durable record)

results.tsv is the append-only ledger (tab-separated; commas break descriptions). The driver writes one row per experiment: exp · commit · latency_s · std_s · max_abs_diff · psnr · quality_ok · peak_mem_gb · status{keep|discard_*} · desc. It is the source of truth and survives context compaction and git reset.

After every experiment, append a reflexion entry to PROGRESS.md (adapted from Meta KernelAgent's schema, the fields that stop you re-grinding):

exp NNN  <change>  ->  <latency or FAIL+reason>
  diagnosis_correct: <was your bottleneck guess right?>
  fix_effective:     <did it do what you expected?>
  lesson:            <what this tells us>
  avoid / try-next:  <what NOT to repeat / what to try next>

Read the dead-end log BEFORE proposing anything. Never retry a logged dead end unless you can state why the prior failure no longer applies, and write that reason down. "Out of ideas" is not a reason to stop (§8); it's a reason to read papers, recombine near-misses, or try a more radical change.

7. The verifier is adversarial, anti-reward-hacking rules

These are the exploits other projects actually observed, and the standing defenses. You must not do any of these, and the harness is built to catch them anyway.

🚫 No special-casing the benchmark input, caching outputs across runs, or mutating the input tensor in place. (The harness reloads the latent fresh and compares against a reference frozen on disk, computed by a path that never touches optimize.py.)
🚫 No wrapping a wrong fast path in try/except with a slow correct fallback. No making the decode trivially cheap by skipping real work.
🚫 No editing or deleting anything under harness/, model/, or goals.json. Editing the evaluator to pass is the textbook reward hack and is strictly forbidden.
✅ Correctness is parity with the frozen reference, not your own judgment. A result with quality_ok = false is a failure regardless of latency.
✅ Generalization check (hardening): if you suspect a change only helps the one fixed input, it should still pass on an unseen latent of the same shape. Ask the human to add a held-out latent if you're optimizing anything input-shape-specific.
✅ Evidence or it didn't happen: never claim "faster" or "done" without pasting the driver's actual output (the latency line + the gate verdict). Assertions don't count.

Assume residual reward-hacking is always possible. If a result looks too good, distrust it and re-verify before logging a keep.

8. Persistence & stop conditions

NEVER STOP on your own. Once the loop has begun, do not pause to ask "should I keep going?" The human may be away and expects autonomous progress until manually stopped or a goals.json stop condition fires. Running out of ideas is not a stop condition, think harder: re-read the in-scope files, read the papers referenced in the dead-end log, combine previous near-misses, attempt a more radical change.

Context hygiene for long runs: redirect noisy command output to a log file (> run.log 2>&1, never tee into your context); extract only the metric/verdict lines you need; read full logs only on failure (tail -n 50 run.log). The durable record is results.tsv + PROGRESS.md + git, not the chat. If compaction happens, those files re-orient you.

Legitimate termination (any one):

Target met AND verified: at/under goals.json target, gate green, reproduced once, and a fresh-context grader (a separate subagent that sees only the diff + the gate) confirms the win. Then summarize the milestone ladder in PROGRESS.md and stop.
No-progress → escalate, not quit: if 8 experiments pass with no gate-passing improvement, write a summary of what's exhausted and what's left to PROGRESS.md and escalate to the human. (A Stop hook may gate end-of-turn on a logged experiment; note Claude Code overrides a Stop hook after 8 consecutive blocks, so it must reflect real progress, not "loop forever.")
Budget cap: the experiment/wall-clock/$ cap in goals.json is the hard backstop.

Watchdogs: a single experiment should take roughly one compile + a few timed runs; if a run hangs far beyond that, kill it and log a crash. If a run crashes on something trivial (typo, missing import), fix and re-run; if the idea is fundamentally broken, log crash and move on after a few attempts.

9. Provenance (what this synthesizes)

Karpathy autoresearch, two-loop separation, single mutable file + frozen scorer, git keep-or-revert ledger, results.tsv, parsimony rule, NEVER-STOP + escalation, context hygiene, baseline-as-run-0. github.com/karpathy/autoresearch
KernelBench (Stanford, arXiv 2502.10517), fast_p (correct AND faster), measure vs a fixed baseline, ≥5 randomized inputs, per-dtype allclose (fp32 1e-4, fp16/bf16 1e-2), CUDA-event timing with warmup + many repeats.
Sakana robust-kbench (arXiv 2509.14279), the reward-hacking catalog + defenses (multiple inputs/seeds/init, forward+backward, unseen-shape generalization, run candidate before reference, copy inputs, contamination filter, "assume residual hacking").
Meta KernelAgent, beam over top-K, hardware-guided diagnose→fix, the per-round Reflexion schema (diagnosis_correct / fix_effective / lessons / avoid / try-next), regression guard, early-stop on stagnation.
Anthropic long-running-agent guidance, frozen state as JSON, evidence-not-assertions, fresh-context grader, durable on-disk log as source of truth, compaction steering, no-progress→escalate. anthropic.com/engineering/effective-harnesses-for-long-running-agents

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

program.md: the auto-optim optimization loop (the runbook)

0. The one principle

1. Roles & files

2. Setup (once, with the human)

3. The objective & the gate

4. The experiment loop (LOOP, do not stop on your own)

4b. Act on your diagnosis, do NOT nibble knobs

5. Accept / reject (parsimony + regression guard)

6. Logging & memory (the durable record)

7. The verifier is adversarial, anti-reward-hacking rules

8. Persistence & stop conditions

9. Provenance (what this synthesizes)

Uh oh!

FilesExpand file tree

program.md

Latest commit

History

program.md

File metadata and controls

program.md: the auto-optim optimization loop (the runbook)

0. The one principle

1. Roles & files

2. Setup (once, with the human)

3. The objective & the gate

4. The experiment loop (LOOP, do not stop on your own)

4b. Act on your diagnosis, do NOT nibble knobs

5. Accept / reject (parsimony + regression guard)

6. Logging & memory (the durable record)

7. The verifier is adversarial, anti-reward-hacking rules

8. Persistence & stop conditions

9. Provenance (what this synthesizes)