paiml · noahgift · May 20, 2026 · May 20, 2026
diff --git a/evidence/phase-6/v1004-3knob-dispatch-recipe-2026-05-20.md b/evidence/phase-6/v1004-3knob-dispatch-recipe-2026-05-20.md
@@ -0,0 +1,122 @@
+# V1_004 dispatch recipe — using the post-M32d 3-knob toolkit
+
+[Top spec](../../docs/specifications/claude-code-parity-apr-poc.md) | [Phase 6 plan](../../docs/specifications/phase-6-under-contract-bench-plan.md) | [M32d shipped](m32d-shipped-2026-05-20.md) | [M32d bench pattern](m32d-bench-pattern-2026-05-20.md)
+
+**Status (2026-05-20, M288)**: Recipe doc for the next-step V1_004 bench dispatches. Post-M32d (paiml/aprender#1832 MERGED), three independent knobs are now available in `QuantizedGenerateConfig` to tune the 30B-MoE student's behavior under contract. The M287 evidence (uniform `driver_error, turns_before_error: 4` across 6 fixtures) is the greedy-decoding baseline. This doc records the recommended sequence of follow-up dispatches to find a configuration that lands `oracle_passed` on at least one fixture.
+
+## The 3-knob toolkit
+
+| Knob | Contract | Code PR | Mechanism | Tuning intuition |
+|---|---|---|---|---|
+| **Sampling** | qwen3-moe-sampling-v1 | aprender#1842 | temperature/top_k/top_p on logits | Lower temperature → more confident next-token; higher temperature → more exploration. \`top_k=50, top_p=0.95\` is the standard chat preset. |
+| **Repetition penalty** | qwen3-moe-repetition-penalty-v1 | aprender#1844 | down-weights recently-generated tokens | `repeat_penalty=1.1-1.3` breaks textual loops. Tighter `repeat_last_n` (e.g. 32) penalizes only recent context; wider (e.g. 128) penalizes more aggressively. |
+| **Streaming SSE** | qwen3-moe-streaming-sse-v1 | aprender#1835 (contract only; impl not yet shipped) | per-token SSE emit | Doesn't change generated content, only delivery. Useful for UX in interactive chat; orthogonal to V1_004 discharge. |
+
+## Why each knob might help V1_004
+
+Empirical observation from [M287](m32d-bench-pattern-2026-05-20.md): 30B-MoE under greedy decoding generates ~1024 tokens per turn of exploratory commentary instead of tool calls. The bench hits 900s per-turn timeout after 4 such commentary turns. Each knob addresses a different theory of why:
+
+- **Sampling at low temperature** (e.g. 0.3): concentrates probability mass on the model's most-confident next token. The argmax of a flat distribution chooses by tie-break (which favors the BPE-numerically-lower index, often a comment/text token); softmax with temperature 0.3 amplifies the gap so the model's top-1 (often an action token) wins more decisively.
+
+- **Repetition penalty** (e.g. 1.2): the M287 turn-1 example shows the model produced the SAME Rust snippet 3 times in one turn. With penalty=1.2 applied to the last 64 tokens, the second instance of that snippet would have its tokens down-weighted, forcing the model off the textual loop and possibly into a tool-call.
+
+- **Top-k=50 with top-p=0.95**: typical chat decoding parameters that the model was likely trained on. The greedy baseline forces argmax everywhere, which the model may not be calibrated for.
+
+## Recommended dispatch sequence
+
+Each sub-bench is ~10-15 hour wall. Operator dispatches one at a time and inspects scores.json before moving on. Stop early if V1_004 discharges.
+
+### Sub-bench A: Sampling alone (baseline temperature only)
+
+```bash
+# Companion bench-script env-var plumbing already supports temperature
+# (the bench passes gen_config to apr code which seeds QuantizedGenerateConfig).
+APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
+PHASE6_COMPLIANCE_ENFORCED=1 \
+PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
+APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
+APR_AGENT_MAX_TOKENS_CAP=1024 \
+APR_AGENT_TEMPERATURE=0.3 \
+APR_AGENT_TOP_K=50 \
+APR_AGENT_TOP_P=0.95 \
+bash scripts/phase-6-bench.sh 2>&1 | tee /tmp/phase-6-30b-temp-0.3.log
+```
+
+Compare result vs the M287 greedy baseline (`evidence/under-contract/scores.json`):
+- If `turns_before_error` rises above 4 OR any fixture `oracle_passed`: sampling helps. Lock in temp=0.3 and move to sub-bench B.
+- If pattern unchanged (`driver_error turns=4` across 20): sampling alone doesn't break the loop. Move to sub-bench C (rep penalty).
+
+(Note: `APR_AGENT_TEMPERATURE` + `APR_AGENT_TOP_K` + `APR_AGENT_TOP_P` env-var plumbing in the bench script + apr code dispatcher is NOT YET shipped. It's the bench-side companion work for aprender#1842. If those env vars aren't read yet, the bench will use greedy regardless. Operator should verify before dispatching: `grep APR_AGENT_TEMPERATURE scripts/phase-6-bench.sh` + check apr code's env var read logic.)
+
+### Sub-bench B: Sampling + repetition penalty
+
+```bash
+# Same as A plus:
+APR_AGENT_REPEAT_PENALTY=1.2 \
+APR_AGENT_REPEAT_LAST_N=64 \
+bash scripts/phase-6-bench.sh
+```
+
+Use IF sub-bench A showed any improvement (turn count or oracle_passed). Combines the two complementary effects: sampling distributes probability away from low-confidence tokens; rep penalty kills loops.
+
+### Sub-bench C: Repetition penalty alone
+
+```bash
+# Greedy decoding + repetition penalty only
+APR_MODEL=... PHASE6_COMPLIANCE_ENFORCED=1 ...
+APR_AGENT_REPEAT_PENALTY=1.2 \
+APR_AGENT_REPEAT_LAST_N=64 \
+bash scripts/phase-6-bench.sh
+```
+
+Use IF sub-bench A showed no improvement. Rep penalty alone (with greedy) is the minimal change from the M287 baseline. If THIS shows improvement, the bottleneck was specifically text-loop-driven, not exploration-driven.
+
+### Sub-bench D: All three plus lower max_tokens cap
+
+```bash
+# Most aggressive: combine everything + force shorter responses
+APR_AGENT_MAX_TOKENS_CAP=256 \
+APR_AGENT_TEMPERATURE=0.3 APR_AGENT_TOP_K=50 APR_AGENT_TOP_P=0.95 \
+APR_AGENT_REPEAT_PENALTY=1.3 APR_AGENT_REPEAT_LAST_N=128 \
+bash scripts/phase-6-bench.sh
+```
+
+Use if A-C all fail. 256-token cap forces the model to either tool-call quickly or burn its turn budget faster (allowing more turns per fixture wall).
+
+## Outcome interpretation
+
+For each sub-bench's `scores.json`, look at:
+
+1. **`student_pass_rate`**: ANY value > 0 discharges V1_004. Stop and ship the M289 discharge celebration.
+2. **`turns_before_error` distribution**: if 90% of fixtures still show `4`, the loop isn't broken. If some show 8-15, the knob shifted behavior but didn't land oracle.
+3. **`oracle_failed_after_max_turns` vs `driver_error`**: if class shifts from `driver_error` (timeout) to `oracle_failed_after_max_turns` (reached max_turns but never converged), the model is acting faster but still not solving. Try higher max_turns or smaller model.
+4. **`compliance_cost_ratio`** (treatment/control): if both treatment + control still 0%, the compound effect (sampling + contract) is unmeasurable without a student that can actually solve under either regime.
+
+## Companion-side prerequisites (must ship before dispatching)
+
+These env vars need to flow from bench script → apr code → `QuantizedGenerateConfig`:
+
+- `APR_AGENT_TEMPERATURE` (f32)
+- `APR_AGENT_TOP_K` (usize)
+- `APR_AGENT_TOP_P` (f32)
+- `APR_AGENT_REPEAT_PENALTY` (f32)
+- `APR_AGENT_REPEAT_LAST_N` (usize)
+
+`scripts/phase-6-bench.sh` line 291 passes `--driver-per-turn-timeout=${APR_TIMEOUT_S}` to ccpa-arena-bench. The ccpa-arena-bench → apr code → QuantizedGenerateConfig chain needs corresponding plumbing for the 5 new env vars. If they're not plumbed, the operator dispatches will silently use defaults (greedy/no-penalty).
+
+Action item: separate companion PR (M289) to add the env-var plumbing. Tracked as "Phase 3" of aprender#1843 (closed, superseded by #1844). The plumbing is small (~50 LOC across 2 files) but is operator-coordinated since the bench bin lives in this repo.
+
+## What this doc is NOT
+
+- NOT a V1_004 discharge celebration — none of these sub-benches have been dispatched yet
+- NOT a binding sequence — operator may choose any subset based on time budget
+- NOT a substitute for the M287 evidence — that pattern (uniform `driver_error turns=4` under greedy) is the baseline these sub-benches compare against
+
+## Cross-references
+
+- [aprender#1832 (M32d)](https://github.com/paiml/aprender/pull/1832) — KV cache; the prerequisite for any of these sub-benches to be tractable
+- [aprender#1842 (sampling impl)](https://github.com/paiml/aprender/pull/1842) — temperature/top_k/top_p in `sample_from_logits`
+- [aprender#1844 (rep-penalty impl)](https://github.com/paiml/aprender/pull/1844) — supersedes #1843; stacked on #1842
+- [aprender#1835 (streaming SSE contract)](https://github.com/paiml/aprender/pull/1835) — third sibling contract; impl not yet shipped (orthogonal to V1_004)
+- [m32d-shipped-2026-05-20.md](m32d-shipped-2026-05-20.md) — M286 doc: M32d empirical + V1_004 dispatch readiness
+- [m32d-bench-pattern-2026-05-20.md](m32d-bench-pattern-2026-05-20.md) — M287 doc: post-M32d bench pattern (greedy baseline)