You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
### 1. Reviewer retry is a null signal at this model scale
31
+
### 1. The Reviewer signal is orthogonal to correctness
31
32
32
-
`avg_attempts_per_task = 1.0` — Qwen-1.5B's first attempts are already ruff-clean **every single time** on 164 HumanEval+ tasks. The critique loop never fires. Pass@1 is within noise of the greedy baseline (0.622 vs 0.627).
33
+
`Coordinator v1` gated retry on **ruff** (style). `avg_attempts_per_task = 1.0` — Qwen's first attempts are already ruff-clean **every single time** on 164 tasks. The retry loop never fired because ruff never flagged the *semantic* errors that actually cause test failures.
33
34
34
-
**Why this happened**: ruff catches *stylistic* defects (unused imports, bad formatting). The model's mistakes are *semantic* (wrong logic). No overlap → no lift.
35
+
**Fix (Coordinator v2)**: gate retry on **Verifier** (test pass/fail) instead. Re-prompt with test-failure trace + ruff suggestions combined. Result: avg_attempts jumped to 1.62 and pass@1 rose to 0.665 (+3.8 over greedy).
35
36
36
-
**Decision**: drop Reviewer retry from Week 4 GRPO training. Pure latency cost for zero signal. Revisit in Week 5 as an ablation candidate if we switch Reviewer to run actual tests (failure traces → retry prompts).
37
+
**Decision**: keep Coordinator v2 as Week 2's inference-time baseline. Drop from Week 4 GRPO training regardless — training doesn't need retry, the reward directly drives gradient updates.
37
38
38
39
### 2. The oracle ceiling is 0.783 pass@5 — that's the real project target
"note": "Coordinator v2 — retry gated on Verifier (test pass/fail), critique injects test-failure trace + ruff suggestions. Beats greedy baseline (0.627) by +3.8 points. Average 1.62 attempts means ~62% of tasks triggered at least one retry. This is the first meaningful inference-time compute result."
0 commit comments