Skip to content

Commit de894a9

Browse files
docs: coordinator v2 result — pass@1=0.665, +3.8 over greedy
1 parent 8153fe3 commit de894a9

2 files changed

Lines changed: 23 additions & 7 deletions

File tree

docs/WEEK2_SUMMARY.md

Lines changed: 9 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -17,23 +17,24 @@ Notes for future-me. Full design in `docs/WEEK2_GUIDE.md`, results in `docs/base
1717
- **Coordinator** (`src/verifiable_rl_coder/orchestration/coordinator.py`) — chains Proposer → Reviewer → Proposer (retry) → Verifier.
1818
- **CHTC pipeline** (`chtc/coordinator.*`, `scripts/run_coordinator.py`) — same pattern as Week 1 baseline.
1919

20-
## The three Week 2 numbers
20+
## The four Week 2 numbers
2121

2222
| Config | pass@1 | pass@5 | note |
2323
|---|---|---|---|
2424
| Greedy baseline (Week 1) | **0.627** | 0.707 | n=5, temp=0.2 |
25-
| Coordinator retry | 0.622 || n=1, temp=0.2, max_rounds=3 |
25+
| Coordinator v1 — ruff retry | 0.622 || avg_attempts=1.0, null signal |
26+
| **Coordinator v2 — verifier retry** | **0.665** || avg_attempts=1.62, **+3.8 over greedy** |
2627
| Best-of-8 (oracle ceiling) | 0.576 | **0.783** | n=8, temp=0.7 |
2728

2829
## Headline findings
2930

30-
### 1. Reviewer retry is a null signal at this model scale
31+
### 1. The Reviewer signal is orthogonal to correctness
3132

32-
`avg_attempts_per_task = 1.0` — Qwen-1.5B's first attempts are already ruff-clean **every single time** on 164 HumanEval+ tasks. The critique loop never fires. Pass@1 is within noise of the greedy baseline (0.622 vs 0.627).
33+
`Coordinator v1` gated retry on **ruff** (style). `avg_attempts_per_task = 1.0` — Qwen's first attempts are already ruff-clean **every single time** on 164 tasks. The retry loop never fired because ruff never flagged the *semantic* errors that actually cause test failures.
3334

34-
**Why this happened**: ruff catches *stylistic* defects (unused imports, bad formatting). The model's mistakes are *semantic* (wrong logic). No overlap → no lift.
35+
**Fix (Coordinator v2)**: gate retry on **Verifier** (test pass/fail) instead. Re-prompt with test-failure trace + ruff suggestions combined. Result: avg_attempts jumped to 1.62 and pass@1 rose to 0.665 (+3.8 over greedy).
3536

36-
**Decision**: drop Reviewer retry from Week 4 GRPO training. Pure latency cost for zero signal. Revisit in Week 5 as an ablation candidate if we switch Reviewer to run actual tests (failure traces → retry prompts).
37+
**Decision**: keep Coordinator v2 as Week 2's inference-time baseline. Drop from Week 4 GRPO training regardless — training doesn't need retry, the reward directly drives gradient updates.
3738

3839
### 2. The oracle ceiling is 0.783 pass@5 — that's the real project target
3940

@@ -87,7 +88,8 @@ scripts/run_coordinator.py
8788
chtc/coordinator.{sub,sh}, submit_coordinator.sh
8889
docs/baselines/
8990
qwen-1.5b-humaneval-plus.json # 0.627 greedy
90-
coord-qwen-1.5b-humaneval.json # 0.622 coordinator
91+
coord-qwen-1.5b-humaneval.json # 0.622 coordinator v1 (ruff retry, null)
92+
coord-v2-qwen-1.5b-humaneval.json # 0.665 coordinator v2 (verifier retry) — the real number
9193
bestof8-qwen-1.5b-humaneval.json # 0.783 oracle pass@5 — the target
9294
```
9395

Lines changed: 14 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,14 @@
1+
{
2+
"benchmark": "humaneval_plus",
3+
"model_id": "Qwen/Qwen2.5-Coder-1.5B-Instruct",
4+
"model_alias": "qwen-1.5b-base",
5+
"max_rounds": 3,
6+
"temperature": 0.2,
7+
"retry_mode": "verifier",
8+
"num_tasks": 164,
9+
"avg_attempts_per_task": 1.6219512195121952,
10+
"pass_at_k": {
11+
"1": 0.6646341463414634
12+
},
13+
"note": "Coordinator v2 — retry gated on Verifier (test pass/fail), critique injects test-failure trace + ruff suggestions. Beats greedy baseline (0.627) by +3.8 points. Average 1.62 attempts means ~62% of tasks triggered at least one retry. This is the first meaningful inference-time compute result."
14+
}

0 commit comments

Comments
 (0)