docs: coordinator v2 result — pass@1=0.665, +3.8 over greedy

Devesh-Maheshwari · Devesh-Maheshwari · commit de894a940522 · 2026-04-23T21:23:25.000-05:00
diff --git a/docs/WEEK2_SUMMARY.md b/docs/WEEK2_SUMMARY.md
@@ -17,23 +17,24 @@ Notes for future-me. Full design in `docs/WEEK2_GUIDE.md`, results in `docs/base
 - **Coordinator** (`src/verifiable_rl_coder/orchestration/coordinator.py`) — chains Proposer → Reviewer → Proposer (retry) → Verifier.
 - **CHTC pipeline** (`chtc/coordinator.*`, `scripts/run_coordinator.py`) — same pattern as Week 1 baseline.
 
-## The three Week 2 numbers
+## The four Week 2 numbers
 
 | Config | pass@1 | pass@5 | note |
 |---|---|---|---|
 | Greedy baseline (Week 1) | **0.627** | 0.707 | n=5, temp=0.2 |
-| Coordinator retry | 0.622 | — | n=1, temp=0.2, max_rounds=3 |
+| Coordinator v1 — ruff retry | 0.622 | — | avg_attempts=1.0, null signal |
+| **Coordinator v2 — verifier retry** | **0.665** | — | avg_attempts=1.62, **+3.8 over greedy** |
 | Best-of-8 (oracle ceiling) | 0.576 | **0.783** | n=8, temp=0.7 |
 
 ## Headline findings
 
-### 1. Reviewer retry is a null signal at this model scale
+### 1. The Reviewer signal is orthogonal to correctness
 
-`avg_attempts_per_task = 1.0` — Qwen-1.5B's first attempts are already ruff-clean **every single time** on 164 HumanEval+ tasks. The critique loop never fires. Pass@1 is within noise of the greedy baseline (0.622 vs 0.627).
+`Coordinator v1` gated retry on **ruff** (style). `avg_attempts_per_task = 1.0` — Qwen's first attempts are already ruff-clean **every single time** on 164 tasks. The retry loop never fired because ruff never flagged the *semantic* errors that actually cause test failures.
 
-**Why this happened**: ruff catches *stylistic* defects (unused imports, bad formatting). The model's mistakes are *semantic* (wrong logic). No overlap → no lift.
+**Fix (Coordinator v2)**: gate retry on **Verifier** (test pass/fail) instead. Re-prompt with test-failure trace + ruff suggestions combined. Result: avg_attempts jumped to 1.62 and pass@1 rose to 0.665 (+3.8 over greedy).
 
-**Decision**: drop Reviewer retry from Week 4 GRPO training. Pure latency cost for zero signal. Revisit in Week 5 as an ablation candidate if we switch Reviewer to run actual tests (failure traces → retry prompts).
+**Decision**: keep Coordinator v2 as Week 2's inference-time baseline. Drop from Week 4 GRPO training regardless — training doesn't need retry, the reward directly drives gradient updates.
 
 ### 2. The oracle ceiling is 0.783 pass@5 — that's the real project target
 
@@ -87,7 +88,8 @@ scripts/run_coordinator.py
 chtc/coordinator.{sub,sh}, submit_coordinator.sh
 docs/baselines/
   qwen-1.5b-humaneval-plus.json         # 0.627 greedy
-  coord-qwen-1.5b-humaneval.json        # 0.622 coordinator
+  coord-qwen-1.5b-humaneval.json        # 0.622 coordinator v1 (ruff retry, null)
+  coord-v2-qwen-1.5b-humaneval.json     # 0.665 coordinator v2 (verifier retry) — the real number
   bestof8-qwen-1.5b-humaneval.json      # 0.783 oracle pass@5 — the target
 ```
 
diff --git a/docs/baselines/coord-v2-qwen-1.5b-humaneval.json b/docs/baselines/coord-v2-qwen-1.5b-humaneval.json
@@ -0,0 +1,14 @@
+{
+    "benchmark": "humaneval_plus",
+    "model_id": "Qwen/Qwen2.5-Coder-1.5B-Instruct",
+    "model_alias": "qwen-1.5b-base",
+    "max_rounds": 3,
+    "temperature": 0.2,
+    "retry_mode": "verifier",
+    "num_tasks": 164,
+    "avg_attempts_per_task": 1.6219512195121952,
+    "pass_at_k": {
+        "1": 0.6646341463414634
+    },
+    "note": "Coordinator v2 — retry gated on Verifier (test pass/fail), critique injects test-failure trace + ruff suggestions. Beats greedy baseline (0.627) by +3.8 points. Average 1.62 attempts means ~62% of tasks triggered at least one retry. This is the first meaningful inference-time compute result."
+}