Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions evidence/phase-6/v1004-3knob-plumbing-shipped-2026-05-20.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# V1_004 3-knob plumbing SHIPPED — M288 prerequisites resolved

[M288 dispatch recipe](v1004-3knob-dispatch-recipe-2026-05-20.md) | [M287 bench pattern](m32d-bench-pattern-2026-05-20.md) | [M286 M32d shipped](m32d-shipped-2026-05-20.md)

**Status (2026-05-20, M289)**: Companion-side update noting that the [M288 dispatch recipe](v1004-3knob-dispatch-recipe-2026-05-20.md)'s prerequisite — env-var plumbing for the 3-knob toolkit — is now SHIPPED at paiml/aprender#1846. Operator can dispatch the M288 sub-benches (A/B/C/D) as soon as #1846 merges + apr binary is rebuilt.

## What changed since M288

[M288](v1004-3knob-dispatch-recipe-2026-05-20.md) noted (verbatim):

> Note: `APR_AGENT_TEMPERATURE` + `APR_AGENT_TOP_K` + `APR_AGENT_TOP_P` env-var plumbing in the bench script + apr code dispatcher is NOT YET shipped. It's the bench-side companion work for aprender#1842. If those env vars aren't read yet, the bench will use greedy regardless.

That gap is now closed. paiml/aprender#1846 ships:

1. **`ChatCompletionRequest` extensions** — 5 new optional fields (`top_k`, `repeat_penalty`, `repeat_last_n`, `seed`, plus existing `top_p`). Aprender-specific extensions to the OpenAI schema; all `#[serde(default)]` so existing clients are unaffected.
2. **`try_qwen3_moe_backend` wire-up** — threads the 5 fields from HTTP request through to `QuantizedGenerateConfig::default()` overrides. Previously hardcoded to defaults (greedy).
3. **`AprServeDriver::build_openai_body` env-var reads** — reads 6 env vars (`APR_AGENT_TEMPERATURE`, `APR_AGENT_TOP_K`, `APR_AGENT_TOP_P`, `APR_AGENT_REPEAT_PENALTY`, `APR_AGENT_REPEAT_LAST_N`, `APR_AGENT_SEED`) and includes in HTTP body when set. apr code dispatches → apr serve receives JSON → `try_qwen3_moe_backend` consumes → `sample_from_logits` applies.

## End-to-end flow (post-#1846)

```
operator shell
APR_AGENT_TEMPERATURE=0.3 APR_AGENT_REPEAT_PENALTY=1.2 ...
└─→ scripts/phase-6-bench.sh (inherits env)
└─→ ccpa-arena-bench --driver-binary apr code ... (env inherited)
└─→ apr code (env inherited)
└─→ AprServeDriver::launch (env inherited)
└─→ apr serve (HTTP server; subprocess of apr code)
└─→ AprServeDriver::build_openai_body (READS env vars)
└─→ HTTP POST /v1/chat/completions {temperature: 0.3, ...}
└─→ try_qwen3_moe_backend (PARSES request)
└─→ QuantizedGenerateConfig {temperature: 0.3, repeat_penalty: 1.2, ...}
└─→ run_qwen3_moe_generate
└─→ sample_from_logits (APPLIES penalty + sampling)
```

Every link in the chain is now wired. The only thing standing between operator and V1_004 discharge measurement is:

1. `gh pr merge 1846 --squash --admin` (operator authorization)
2. Rebuild + install apr binary from new main: `cd /home/noah/src/aprender && cargo build --release -p apr-cli --bin apr && cp /mnt/nvme-raid0/targets/aprender/release/apr /home/noah/.local/bin/apr`
3. Dispatch one of the [M288 sub-benches](v1004-3knob-dispatch-recipe-2026-05-20.md#recommended-dispatch-sequence)

## Recommended first dispatch

Sub-bench B from M288 (sampling + repetition penalty combined) is the strongest single-shot candidate to break the M287 driver_error pattern:

```bash
APR_MODEL=/home/noah/models/Qwen3-Coder-30B-A3B-Instruct-Q4_K_M.gguf \
PHASE6_COMPLIANCE_ENFORCED=1 \
PHASE6_MAX_TURNS=20 \
PHASE6_WALL_SECONDS=3600 \
APR_TIMEOUT_S=900 \
APR_AGENT_HTTP_TIMEOUT_S=1500 \
APR_AGENT_MAX_TOKENS_CAP=1024 \
APR_AGENT_TEMPERATURE=0.3 \
APR_AGENT_TOP_K=50 \
APR_AGENT_TOP_P=0.95 \
APR_AGENT_REPEAT_PENALTY=1.2 \
APR_AGENT_REPEAT_LAST_N=64 \
bash scripts/phase-6-bench.sh 2>&1 | tee /tmp/phase-6-30b-3knob-b.log
```

Expected wall: ~10-15 hours. Acceptance: `evidence/under-contract/scores.json` with `student_pass_rate > 0` discharges V1_004.

## Status reconciliation across all aprender PRs

| PR | Status | Content |
|---|---|---|
| #1832 | ✅ MERGED | M32d KV cache (19× speedup) |
| #1837 | ✅ MERGED | qwen3-moe-sampling-v1 contract |
| #1842 | ✅ MERGED | sampling impl + tests (absorbed #1844 in squash) |
| #1844 | ✅ MERGED | rep-penalty impl + tests |
| #1843 | ❌ CLOSED | superseded by #1844 |
| #1835 | 🟡 OPEN | qwen3-moe-streaming-sse-v1 contract (workspace-test pending) |
| #1846 | 🟡 OPEN | 3-knob HTTP wire-up (this PR's prerequisite) |
| #1829 | ❌ CLOSED | M32d engineer playbook (superseded by in-session #1832) |
| #1826 | ✅ MERGED | M32d scope doc |

## Companion-side state

CCPA M281-M288 (8 docs) tracks the full upstream arc + dispatch recipe. M289 (this doc) closes the loop noting the plumbing is shipped.

The currently-running greedy V1_004 bench (started 05:58Z, on fixture ~8-10 by now) is the empirical baseline against which any 3-knob sub-bench compares. Don't dispatch the new bench until the baseline finishes — the comparison is the value.

## What this doc is NOT

- NOT a V1_004 discharge — sub-benches not dispatched yet
- NOT an aprender#1846 merge — operator decision
- NOT a guarantee that the 3-knob toolkit will discharge V1_004 — M287 pattern is the empirical hypothesis-under-test; sampling/penalty may or may not break it. The bench is the measurement.
Loading