Short, always-loaded contract. The full procedure is program.md (the runbook),
read it before you start. Target spec + thresholds + stop conditions live in goals.json
(frozen JSON). Current best + dead-end log live in PROGRESS.md.
Your job: minimize Wan 2.2 VAE decode latency on a single H100 without failing the
quality gate, by running the keep-or-revert loop in program.md. You do not stop on
your own; only a goals.json stop condition ends the loop.
export CUDA_VISIBLE_DEVICES=<free gpu> # never benchmark on a busy GPU
python harness/make_reference.py # once: build the frozen reference
python harness/run_experiment.py --desc "..." # after EVERY edit to optimize.py
python harness/verify.py # quality gate only (exit 0 = pass)
python harness/bench.py # latency only
python harness/run_profile.py --exp <n> # detailed warmup+profiler (auto-run on every KEEP)Every KEEP auto-runs profile.py and appends the breakdown (top ops + GPU-util/idle) to
PROGRESS.md. The next hypothesis comes from that profile, not a guess, if GPU util is
low the decode is host-bound (recover idle time); if high, attack the top op.
run_experiment.py runs verify.py (quality vs the frozen fp32 reference) then bench.py
(latency). A result is KEPT only if it PASSES the gate AND is faster than the current best;
everything else is auto-reverted. The frozen evaluator decides, you do not grade your own
work. Project success = a kept result at/under the goals.json target, confirmed by a
fresh-context grader.
✅ Always: edit only optimize.py; run the driver after every edit; profile before
guessing; append a reflexion entry to PROGRESS.md (what you tried / why it failed); prefer
the simplest change that moves the metric; paste the driver's real output as evidence.
goals.json cap.
🚫 NEVER (YOU MUST NOT):
- Edit or delete
goals.json, anything inharness/, or anything inmodel/. Editing the evaluator to pass is reward hacking, forbidden. - Special-case the benchmark input, cache outputs across runs, mutate the input in place, or
wrap a wrong fast path in
try/exceptwith a slow fallback. - Mark/claim a result good when
quality_okis false. - Claim "faster"/"done" without pasting the driver output.
- Stop the loop to ask whether to continue. "Hard" or "slow" is not a stop condition.
Stuck for 8 experiments → summarize in
PROGRESS.mdand escalate; never silently quit.
The decode must produce the same video. "Faster" only counts if the gate stays green,
enforced, not hoped for. See program.md §7 for the full anti-reward-hacking rules.