This is the "skill" the agent runs over and over to minimize Wan 2.2 VAE decode latency without failing the quality gate. You (the human) iterate on THIS file; the agent iterates on
optimize.py. You program the org, not the Python.This is a synthesis of the field's working practice, Karpathy's
autoresearchloop, the GPU kernel-optimization agents (KernelBench, Sakana robust-kbench, Meta KernelAgent, Kevin-32B, GPU MODE), and Anthropic's long-running-agent guidance. Sources are listed at the bottom.AGENTS.mdis the short always-loaded contract; this file is the full procedure, loaded when you run the loop.
The verifier is adversarial-grade infrastructure, not a test you wrote for yourself. Three independent groups (Sakana, Cognition, GPU MODE) caught their agent cheating the evaluator before it cheated the problem. Sakana's headline 3.13× collapsed to 1.49× once benchmark-gaming kernels were removed. Everything below exists so that the single number we optimize, latency, cannot be moved by anything except a genuinely faster, genuinely correct decode.
| File | Status | Who edits |
|---|---|---|
optimize.py (build_decoder) |
MUTABLE, the only thing the agent changes | agent |
goals.json |
FROZEN target/thresholds/stop-conditions (JSON on purpose: models tamper with JSON far less than Markdown) | human only |
harness/{bench,verify,make_reference,run_experiment,_common}.py |
FROZEN evaluator | human only |
model/, assets/ |
FROZEN model + checkpoint + frozen reference | human only |
results.tsv |
append-only ledger (driver writes; survives git reset) |
driver |
leaderboard.md |
generated best-so-far | driver |
PROGRESS.md |
journal + dead-end log + per-round reflexion | agent |
program.md (this file) |
the methodology | human only |
Two control loops, kept separate: the human optimizes this spec; the agent optimizes the code. The frozen evaluator is the integrity guarantee, the metric can't be gamed because the agent can't touch the scorer.
- Agree a run tag (e.g.
jun25); the branchautooptim/<tag>must not already exist. git checkout -b autooptim/<tag>.- Read
AGENTS.md,goals.json, this file, andPROGRESS.md(best-so-far + dead ends). - Confirm a free GPU (
nvidia-smi) andexport CUDA_VISIBLE_DEVICES=<free idx>. Never benchmark on a busy GPU, timings become noise. - Ensure
assets/reference_fp32.ptexists, elsepython harness/make_reference.py. - First experiment is ALWAYS the untouched baseline (establishes 1.0×).
One scalar: mean decode latency (s) of assets/latent_3s.pt, subject to the hard
quality gate in goals.json. A change that fails the gate has fitness = ∞ (discarded).
The gate compares the decode against the frozen fp32 reference (the same decoder,
slow and exact): shape-match AND max_abs_diff ≤ tol AND PSNR ≥ floor. This measures
exactly the right question, "did the optimization change the output?", and nothing
else gets a vote. Speedup is always reported against the fixed baseline (run #0).
LOOP:
1. Orient: read PROGRESS.md (best + dead-ends) and the top of leaderboard.md.
2. Diagnose from the LAST profile. Every KEEP auto-runs harness/profile.py (warmup +
detailed profiler), appends the breakdown to PROGRESS.md, and saves a full report +
a chrome/perfetto trace under runs/. READ IT before hypothesizing. Two numbers decide
the next move:
- GPU util HIGH -> COMPUTE-bound: attack the top kernel/op (dtype, layout, fusion).
- GPU util LOW / idle HIGH -> HOST-bound: the GPU is waiting (launch overhead,
CPU<->GPU serialization, Python, gaps). Recover it with fewer/larger kernels,
overlap, CUDA graphs, or compiling a bigger region. The biggest wins hide here.
No current profile? Run it: CUDA_VISIBLE_DEVICES=$GPU python harness/run_profile.py --exp <n>
3. Hypothesize ONE change that attacks what the profile shows. Prefer (in order):
a) the profiled top cost;
b) an untried technique from "candidate directions" in PROGRESS.md;
c) a variation of the last near-miss.
4. Implement it, edit ONLY optimize.py, inside build_decoder.
5. Evaluate: python harness/run_experiment.py --desc "<=10 words"
(the driver runs verify.py THEN bench.py, logs, and keeps-or-reverts).
6. Record in PROGRESS.md (the reflexion entry, §6).
7. On a KEEP: snapshot optimize.py to experiments/exp_NNN_desc.py.
8. Go to 1.
Search discipline (from the kernel agents): keep a mental beam of the top ~2 configs and explore distinct bottleneck directions rather than re-tuning one knob; when you show yourself prior attempts, order them slowest→fastest so the trajectory is legible.
Every profile names a #1 reducible cost. Your next hypothesis MUST attack that cost. Sort your possible moves into two kinds:
- Knob tweaks, compile mode/flags, cudnn.benchmark, autotune options, dtype/layout toggles. Cheap, and quickly exhausted.
- Structural moves, fuse the named hot op into the compiled region, eliminate copies / clones / format-conversions, restructure the hot loop, fold an eager residual into the graph, change how the streaming cache works. Harder, and this is where the remaining wins live once the knobs are spent.
Anti-nibbling rule. A change under ~1% of the current best is within run-to-run noise, it is NOT progress, even if the driver keeps it. If your last 2 kept experiments each gained < 1%, you are nibbling: stop tuning knobs and make a STRUCTURAL change that attacks the current #1 profiled cost, even if it's harder or riskier. Re-read your own "Current bottleneck" note and go after it directly, do not reach for another flag.
If you have genuinely attempted the structural moves the profile points to and only sub-1% noise remains, you have found the floor: reproduce the best once, summarize the milestone ladder in PROGRESS.md, and stop (that is a legitimate result, not a failure).
- Keep only if it PASSES the gate AND is faster than the current best (driver-enforced).
- Parsimony: all else equal, simpler wins. A change that holds latency flat but deletes complexity is a keep. A 0.5% speedup that adds 30 lines of fragile code is a discard. A ~0% change from removing code: keep.
- Regression guard: the driver reverts
optimize.pyon any non-improving result, so a regression can never accumulate. The best-known-good config is always recoverable. - Soft resource ceiling: peak memory may rise some for a real latency win, but must
not blow up (see
goals.json); ask the human before exceeding the stated cap.
results.tsv is the append-only ledger (tab-separated; commas break descriptions). The
driver writes one row per experiment: exp · commit · latency_s · std_s · max_abs_diff · psnr · quality_ok · peak_mem_gb · status{keep|discard_*} · desc. It is the source of
truth and survives context compaction and git reset.
After every experiment, append a reflexion entry to PROGRESS.md (adapted from Meta KernelAgent's schema, the fields that stop you re-grinding):
exp NNN <change> -> <latency or FAIL+reason>
diagnosis_correct: <was your bottleneck guess right?>
fix_effective: <did it do what you expected?>
lesson: <what this tells us>
avoid / try-next: <what NOT to repeat / what to try next>
Read the dead-end log BEFORE proposing anything. Never retry a logged dead end unless you can state why the prior failure no longer applies, and write that reason down. "Out of ideas" is not a reason to stop (§8); it's a reason to read papers, recombine near-misses, or try a more radical change.
These are the exploits other projects actually observed, and the standing defenses. You must not do any of these, and the harness is built to catch them anyway.
- 🚫 No special-casing the benchmark input, caching outputs across runs, or mutating the
input tensor in place. (The harness reloads the latent fresh and compares against a
reference frozen on disk, computed by a path that never touches
optimize.py.) - 🚫 No wrapping a wrong fast path in
try/exceptwith a slow correct fallback. No making the decode trivially cheap by skipping real work. - 🚫 No editing or deleting anything under
harness/,model/, orgoals.json. Editing the evaluator to pass is the textbook reward hack and is strictly forbidden. - ✅ Correctness is parity with the frozen reference, not your own judgment. A result with
quality_ok = falseis a failure regardless of latency. - ✅ Generalization check (hardening): if you suspect a change only helps the one fixed input, it should still pass on an unseen latent of the same shape. Ask the human to add a held-out latent if you're optimizing anything input-shape-specific.
- ✅ Evidence or it didn't happen: never claim "faster" or "done" without pasting the driver's actual output (the latency line + the gate verdict). Assertions don't count.
Assume residual reward-hacking is always possible. If a result looks too good, distrust it and re-verify before logging a keep.
NEVER STOP on your own. Once the loop has begun, do not pause to ask "should I keep
going?" The human may be away and expects autonomous progress until manually stopped or a
goals.json stop condition fires. Running out of ideas is not a stop condition, think
harder: re-read the in-scope files, read the papers referenced in the dead-end log,
combine previous near-misses, attempt a more radical change.
Context hygiene for long runs: redirect noisy command output to a log file
(> run.log 2>&1, never tee into your context); extract only the metric/verdict lines
you need; read full logs only on failure (tail -n 50 run.log). The durable record is
results.tsv + PROGRESS.md + git, not the chat. If compaction happens, those files
re-orient you.
Legitimate termination (any one):
- Target met AND verified: at/under
goals.jsontarget, gate green, reproduced once, and a fresh-context grader (a separate subagent that sees only the diff + the gate) confirms the win. Then summarize the milestone ladder in PROGRESS.md and stop. - No-progress → escalate, not quit: if 8 experiments pass with no gate-passing improvement, write a summary of what's exhausted and what's left to PROGRESS.md and escalate to the human. (A Stop hook may gate end-of-turn on a logged experiment; note Claude Code overrides a Stop hook after 8 consecutive blocks, so it must reflect real progress, not "loop forever.")
- Budget cap: the experiment/wall-clock/$ cap in
goals.jsonis the hard backstop.
Watchdogs: a single experiment should take roughly one compile + a few timed runs; if a
run hangs far beyond that, kill it and log a crash. If a run crashes on something trivial
(typo, missing import), fix and re-run; if the idea is fundamentally broken, log crash
and move on after a few attempts.
- Karpathy
autoresearch, two-loop separation, single mutable file + frozen scorer, git keep-or-revert ledger,results.tsv, parsimony rule, NEVER-STOP + escalation, context hygiene, baseline-as-run-0. github.com/karpathy/autoresearch - KernelBench (Stanford, arXiv 2502.10517),
fast_p(correct AND faster), measure vs a fixed baseline, ≥5 randomized inputs, per-dtypeallclose(fp32 1e-4, fp16/bf16 1e-2), CUDA-event timing with warmup + many repeats. - Sakana robust-kbench (arXiv 2509.14279), the reward-hacking catalog + defenses (multiple inputs/seeds/init, forward+backward, unseen-shape generalization, run candidate before reference, copy inputs, contamination filter, "assume residual hacking").
- Meta KernelAgent, beam over top-K, hardware-guided diagnose→fix, the per-round Reflexion schema (diagnosis_correct / fix_effective / lessons / avoid / try-next), regression guard, early-stop on stagnation.
- Anthropic long-running-agent guidance, frozen state as JSON, evidence-not-assertions, fresh-context grader, durable on-disk log as source of truth, compaction steering, no-progress→escalate. anthropic.com/engineering/effective-harnesses-for-long-running-agents