Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
122 changes: 122 additions & 0 deletions evidence/phase-6/v1004-3knob-dispatch-recipe-2026-05-20.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# V1_004 dispatch recipe — using the post-M32d 3-knob toolkit

[Top spec](../../docs/specifications/claude-code-parity-apr-poc.md) | [Phase 6 plan](../../docs/specifications/phase-6-under-contract-bench-plan.md) | [M32d shipped](m32d-shipped-2026-05-20.md) | [M32d bench pattern](m32d-bench-pattern-2026-05-20.md)

**Status (2026-05-20, M288)**: Recipe doc for the next-step V1_004 bench dispatches. Post-M32d (paiml/aprender#1832 MERGED), three independent knobs are now available in `QuantizedGenerateConfig` to tune the 30B-MoE student's behavior under contract. The M287 evidence (uniform `driver_error, turns_before_error: 4` across 6 fixtures) is the greedy-decoding baseline. This doc records the recommended sequence of follow-up dispatches to find a configuration that lands `oracle_passed` on at least one fixture.

## The 3-knob toolkit

| Knob | Contract | Code PR | Mechanism | Tuning intuition |
|---|---|---|---|---|
| **Sampling** | qwen3-moe-sampling-v1 | aprender#1842 | temperature/top_k/top_p on logits | Lower temperature → more confident next-token; higher temperature → more exploration. \`top_k=50, top_p=0.95\` is the standard chat preset. |
| **Repetition penalty** | qwen3-moe-repetition-penalty-v1 | aprender#1844 | down-weights recently-generated tokens | `repeat_penalty=1.1-1.3` breaks textual loops. Tighter `repeat_last_n` (e.g. 32) penalizes only recent context; wider (e.g. 128) penalizes more aggressively. |
| **Streaming SSE** | qwen3-moe-streaming-sse-v1 | aprender#1835 (contract only; impl not yet shipped) | per-token SSE emit | Doesn't change generated content, only delivery. Useful for UX in interactive chat; orthogonal to V1_004 discharge. |

## Why each knob might help V1_004

Empirical observation from [M287](m32d-bench-pattern-2026-05-20.md): 30B-MoE under greedy decoding generates ~1024 tokens per turn of exploratory commentary instead of tool calls. The bench hits 900s per-turn timeout after 4 such commentary turns. Each knob addresses a different theory of why:

- **Sampling at low temperature** (e.g. 0.3): concentrates probability mass on the model's most-confident next token. The argmax of a flat distribution chooses by tie-break (which favors the BPE-numerically-lower index, often a comment/text token); softmax with temperature 0.3 amplifies the gap so the model's top-1 (often an action token) wins more decisively.

- **Repetition penalty** (e.g. 1.2): the M287 turn-1 example shows the model produced the SAME Rust snippet 3 times in one turn. With penalty=1.2 applied to the last 64 tokens, the second instance of that snippet would have its tokens down-weighted, forcing the model off the textual loop and possibly into a tool-call.

- **Top-k=50 with top-p=0.95**: typical chat decoding parameters that the model was likely trained on. The greedy baseline forces argmax everywhere, which the model may not be calibrated for.

## Recommended dispatch sequence

Each sub-bench is ~10-15 hour wall. Operator dispatches one at a time and inspects scores.json before moving on. Stop early if V1_004 discharges.

### Sub-bench A: Sampling alone (baseline temperature only)

```bash
# Companion bench-script env-var plumbing already supports temperature
# (the bench passes gen_config to apr code which seeds QuantizedGenerateConfig).
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
APR_AGENT_TEMPERATURE=0.3 \
APR_AGENT_TOP_K=50 \
APR_AGENT_TOP_P=0.95 \
bash scripts/phase-6-bench.sh 2>&1 | tee /tmp/phase-6-30b-temp-0.3.log
```

Compare result vs the M287 greedy baseline (`evidence/under-contract/scores.json`):
- If `turns_before_error` rises above 4 OR any fixture `oracle_passed`: sampling helps. Lock in temp=0.3 and move to sub-bench B.
- If pattern unchanged (`driver_error turns=4` across 20): sampling alone doesn't break the loop. Move to sub-bench C (rep penalty).

(Note: `APR_AGENT_TEMPERATURE` + `APR_AGENT_TOP_K` + `APR_AGENT_TOP_P` env-var plumbing in the bench script + apr code dispatcher is NOT YET shipped. It's the bench-side companion work for aprender#1842. If those env vars aren't read yet, the bench will use greedy regardless. Operator should verify before dispatching: `grep APR_AGENT_TEMPERATURE scripts/phase-6-bench.sh` + check apr code's env var read logic.)

### Sub-bench B: Sampling + repetition penalty

```bash
# Same as A plus:
APR_AGENT_REPEAT_PENALTY=1.2 \
APR_AGENT_REPEAT_LAST_N=64 \
bash scripts/phase-6-bench.sh
```

Use IF sub-bench A showed any improvement (turn count or oracle_passed). Combines the two complementary effects: sampling distributes probability away from low-confidence tokens; rep penalty kills loops.

### Sub-bench C: Repetition penalty alone

```bash
# Greedy decoding + repetition penalty only
APR_MODEL=... PHASE6_COMPLIANCE_ENFORCED=1 ...
APR_AGENT_REPEAT_PENALTY=1.2 \
APR_AGENT_REPEAT_LAST_N=64 \
bash scripts/phase-6-bench.sh
```

Use IF sub-bench A showed no improvement. Rep penalty alone (with greedy) is the minimal change from the M287 baseline. If THIS shows improvement, the bottleneck was specifically text-loop-driven, not exploration-driven.

### Sub-bench D: All three plus lower max_tokens cap

```bash
# Most aggressive: combine everything + force shorter responses
APR_AGENT_MAX_TOKENS_CAP=256 \
APR_AGENT_TEMPERATURE=0.3 APR_AGENT_TOP_K=50 APR_AGENT_TOP_P=0.95 \
APR_AGENT_REPEAT_PENALTY=1.3 APR_AGENT_REPEAT_LAST_N=128 \
bash scripts/phase-6-bench.sh
```

Use if A-C all fail. 256-token cap forces the model to either tool-call quickly or burn its turn budget faster (allowing more turns per fixture wall).

## Outcome interpretation

For each sub-bench's `scores.json`, look at:

1. **`student_pass_rate`**: ANY value > 0 discharges V1_004. Stop and ship the M289 discharge celebration.
2. **`turns_before_error` distribution**: if 90% of fixtures still show `4`, the loop isn't broken. If some show 8-15, the knob shifted behavior but didn't land oracle.
3. **`oracle_failed_after_max_turns` vs `driver_error`**: if class shifts from `driver_error` (timeout) to `oracle_failed_after_max_turns` (reached max_turns but never converged), the model is acting faster but still not solving. Try higher max_turns or smaller model.
4. **`compliance_cost_ratio`** (treatment/control): if both treatment + control still 0%, the compound effect (sampling + contract) is unmeasurable without a student that can actually solve under either regime.

## Companion-side prerequisites (must ship before dispatching)

These env vars need to flow from bench script → apr code → `QuantizedGenerateConfig`:

- `APR_AGENT_TEMPERATURE` (f32)
- `APR_AGENT_TOP_K` (usize)
- `APR_AGENT_TOP_P` (f32)
- `APR_AGENT_REPEAT_PENALTY` (f32)
- `APR_AGENT_REPEAT_LAST_N` (usize)

`scripts/phase-6-bench.sh` line 291 passes `--driver-per-turn-timeout=${APR_TIMEOUT_S}` to ccpa-arena-bench. The ccpa-arena-bench → apr code → QuantizedGenerateConfig chain needs corresponding plumbing for the 5 new env vars. If they're not plumbed, the operator dispatches will silently use defaults (greedy/no-penalty).

Action item: separate companion PR (M289) to add the env-var plumbing. Tracked as "Phase 3" of aprender#1843 (closed, superseded by #1844). The plumbing is small (~50 LOC across 2 files) but is operator-coordinated since the bench bin lives in this repo.

## What this doc is NOT

- NOT a V1_004 discharge celebration — none of these sub-benches have been dispatched yet
- NOT a binding sequence — operator may choose any subset based on time budget
- NOT a substitute for the M287 evidence — that pattern (uniform `driver_error turns=4` under greedy) is the baseline these sub-benches compare against

## Cross-references

- [aprender#1832 (M32d)](https://github.com/paiml/aprender/pull/1832) — KV cache; the prerequisite for any of these sub-benches to be tractable
- [aprender#1842 (sampling impl)](https://github.com/paiml/aprender/pull/1842) — temperature/top_k/top_p in `sample_from_logits`
- [aprender#1844 (rep-penalty impl)](https://github.com/paiml/aprender/pull/1844) — supersedes #1843; stacked on #1842
- [aprender#1835 (streaming SSE contract)](https://github.com/paiml/aprender/pull/1835) — third sibling contract; impl not yet shipped (orthogonal to V1_004)
- [m32d-shipped-2026-05-20.md](m32d-shipped-2026-05-20.md) — M286 doc: M32d empirical + V1_004 dispatch readiness
- [m32d-bench-pattern-2026-05-20.md](m32d-bench-pattern-2026-05-20.md) — M287 doc: post-M32d bench pattern (greedy baseline)
Loading