Five-minute decision doc. The detail lives in
SCORECARD.mdand the per-benchmarkfindings*.mddocs; this page is the synthesis. Every claim links to its source so you can drill straight into the evidence.Read
KNOWN-LIMITATIONS.mdbefore quoting any cell. Most caveats live there, not here.Last updated: 2026-05-02 — reflects
microbench-phase-b-2026-05-02(N=10 + 27B-no-think third arm). Pre-no-think readers: the picture has shifted.
Operating point: All arms are Cyankiwi 4-bit AWQ on 2× RTX PRO 6000 Blackwell at 500 W cap. Other quants, VRAM tiers, hardware classes, and languages are not characterized — see What this benchmark doesn't characterize below. The within-quant comparison here is informative; absolute model capability at higher precisions is a separate question.
No model is overall best. The three arms have orthogonal strengths and statistically indistinguishable headline ship rates (74–96%). Pick by task class:
- Default for most non-coding tasks: 27B-no-think — 95.8% ship rate across 12 cells × N=10, beats both originals on raw shipping
- Hallucination-sensitive or research-driven work: 27B (either mode) — 27B-thinking is the only arm that ships market-research at >70%; no-think 10/10 on adversarial-hallucination
- Bounded structured writing where speed/cost matters: Coder-Next — 60–100× cheaper per shipped run on
p3_businessandp3_doc
If you have to pick one arm without knowing your task: 27B-no-think. It dominates raw ship rate, matches thinking-mode on substantive quality (per the pairwise study), and has no cell where it's clearly the worst of the three.
flowchart TD
Start([Pick a local model for your task])
Start --> Q1{Does wrong-output<br/>cause real harm?<br/>e.g. security review,<br/>fact-cited research}
Q1 -->|Yes| A1["27B-no-think<br/>10/10 ship · 100% accuracy<br/>0 dangerous errors<br/>on adversarial-hallucination"]
Q1 -->|No| Q2{Internet research<br/>+ live citations?}
Q2 -->|Yes| A2["27B-thinking<br/>8/10 on p3_market<br/>Coder-Next 0/10 reproducible"]
Q2 -->|No| Q3{Bounded structured output<br/>+ speed/cost critical?<br/>e.g. business memo,<br/>doc synthesis, triage}
Q3 -->|Yes| A3["Coder-Next<br/>10/10 ship<br/>60-100x cheaper per ship<br/>than 27B variants"]
Q3 -->|No| A4["27B-no-think default<br/>95.8% overall ship rate<br/>Wilson 95% 90.5% – 98.2%"]
Start -.if either applies.-> Q5{Long-horizon >30 min<br/>OR high-stakes single-shot,<br/>no verifier?}
Q5 -->|Yes| A5["None of these single-shot<br/>Use cloud + verifier<br/>OR break into smaller tasks"]
The dotted path is independent of the main flow — even if your task fits the "main path" answers, if it also meets the long-horizon-or-high-stakes condition, the local-model answer collapses to "use cloud or add a verifier."
| Model | Architecture | Quantization | Notes |
|---|---|---|---|
| Qwen3-Coder-Next-AWQ | MoE 80B/3B-active | 4-bit AWQ (Cyankiwi) | Code-specialized; not a thinking model. --tool-call-parser qwen3_coder |
| Qwen3.6-27B-AWQ (thinking) | Dense 27B | 4-bit AWQ (Cyankiwi) | Default thinking mode. --tool-call-parser qwen3_xml |
| Qwen3.6-27B-AWQ (no-think) | Dense 27B | 4-bit AWQ (Cyankiwi) | Same model, --no-think flag. Added in microbench-phase-b-2026-05-02 |
All runs: vLLM, --temperature 0.3, --max-model-len 262144, Tower2 hardware (2× RTX PRO 6000 Blackwell).
→ 27B-no-think. Source: 95.8% ship rate across 12 cells × N=10 (Wilson 95% [90.5%, 98.2%]). On the 4 hardest cells (p2_hallucination, p3_business, p3_doc, p3_market) it's 33/38 = 86.8% — still the highest of the three.
Caveat: ship rate ≠ PASS rate. The PASS-rate grader sweep on the no-think tarballs is pending. On the cells where quality has been hand-graded, no-think and thinking are substantively equivalent.
→ 27B (either mode). Source — 6 real bugs vs 9 confident-but-fake fabrications:
- 27B-no-think: 10/10 ship, 100% accuracy, 0 dangerous errors
- 27B-thinking: 7/10 ship, all shipping runs at 100% accuracy
- Coder-Next: 5/10 ship at N=10 (Wilson [23.7%, 76.3%]); the run that ships sometimes lands 2 confirmed-fabrications-as-real
The headline failure cost is asymmetric: Coder-Next failures are cheap to detect but expensive to recover from (you have to verify everything). 27B failures are expensive to detect (you have to read prose) but cheap to recover (the analysis is usually correct).
→ 27B-thinking (with no-think as a backup). Source:
- 27B-thinking: 8/10 ship on
p3_marketat N=10 (Wilson 95% [44.4%, 96.5%]) - 27B-no-think: 7/10 ship — comparable but with scroll-loop pathology requiring operator monitoring
- Coder-Next: 0/10 at N=10 (Wilson [0%, 27.8%]) — confirmed reproducible failure shape
Citation validity: 27B's URLs were sampled at 75% valid (12/14 testable; methodology). Error mode is URL drift, not fabricated facts (every checkable factual claim matched live data).
→ Coder-Next — when it works for your task class. Source:
p3_business(memo): Coder-Next 10/10 at $0.0006/run. 27B-thinking 9/10 at $0.0035. 60× cheaper per ship.p3_doc(700-word brief from 5 sources): Coder-Next 10/10 at $0.0007/run. 27B-thinking 6/10 at $0.0201. ~100× cheaper per ship.p2_triage(customer-support): Coder-Next 96.7% category accuracy vs 27B's 86.7%, and 1.0 min vs 3.3 min wall
→ None of these three models, single-shot. Source: on the 75-PR audit, 27B writes 75 verdict files but only 3 are real reviews (72 are template stubs); Coder-Next produces nothing across 5 attempts. Both find degenerate failure modes within 30–60 minutes.
If you must use a local model for long-horizon work, structure it as many small independent runs with verification between, not one long agentic chain.
→ None of these three models. Source: Coder-Next 2/3 wrong on PR #1057 with fabricated technical evidence (line citations to non-existent issues, a fake test_stderr_truncation.py). 27B doesn't fabricate but also doesn't write the spec-required verdict.md. For high-stakes single-shot, use cloud or add a verifier loop.
Headline ship rates (phase-b, source)
Bars show ship rate as a fraction (each bar is 10 segments regardless of N), so N=3 cells stay visually comparable to N=10 cells.
| Cell | Coder-Next | 27B (thinking) | 27B (no-think) | Cell winner |
|---|---|---|---|---|
| p1_bugfix | ██████████ 2/2 | ░░░░░░░░░░ 0/3 *† | ██████████ 10/10 | 27B-no-think |
| p1_refactor | ██████████ 3/3 | ███░░░░░░░ 1/3 *† | ██████████ 10/10 | 27B-no-think |
| p1_testwrite | ███████░░░ 2/3 | ░░░░░░░░░░ 0/3 *† | ██████████ 10/10 | 27B-no-think |
| p2_ci | ██████████ 3/3 | ██████████ 3/3 | ██████████ 10/10 | tied ship; 27B wins quality (CHANGELOG) |
| p2_extract | ██████████ 3/3 | ██████████ 3/3 | ██████████ 10/10 | tied ship; 27B more accurate |
| p2_hallucination | █████░░░░░ 5/10 | ███████░░░ 7/10 | ██████████ 10/10 | 27B-no-think |
| p2_triage | ██████████ 3/3 | ██████████ 3/3 | ██████████ 10/10 | tied ship; Coder-Next more accurate (96.7% vs 86.7%) |
| p3_business | ██████████ 10/10 | █████████░ 9/10 | ████████░░ 8/10 | Coder-Next |
| p3_doc | ██████████ 10/10 | ██████░░░░ 6/10 | ████████░░ 8/10 | Coder-Next (ship); PASS-rate caveat below |
| p3_market | ░░░░░░░░░░ 0/10 | ████████░░ 8/10 | ███████░░░ 7/10 | 27B-thinking |
| p3_pm | ██████████ 3/3 | ██████████ 3/3 | ██████████ 10/10 | 27B-no-think (ship); all miss multi-week risks |
| p3_writing | ██████████ 3/3 | ██████████ 3/3 | ██████████ 10/10 | 27B-no-think (ship); see findings |
† 27B-thinking N=3 baselines used an older harness sha. The 1/9 P1 ship rate may include harness-related effects; see phase-b caveats.
Pairwise quality study (N=1, claude-grading-claude — see caveats):
| Cell | Coder-Next | 27B-thinking | 27B-no-think |
|---|---|---|---|
| p2_ci (CHANGELOG vs test name) | 3.4 / 5 (regresses v0.3.0 API) | 4.6 / 5 | 4.2 / 5 |
| p2_extract (20-field accuracy + reasoning) | 4.2 / 5 | 4.6 / 5 | 4.4 / 5 |
| p2_triage (urgency calibration) | 4 / 5 | 4 / 5 | 4 / 5 (no-think aligns with thinking, both > Coder-Next on calibration) |
Headline from the quality study: 27B-thinking and 27B-no-think are tightly correlated on substantive decisions. The difference between them is verbosity of reasoning prose, not output correctness. Treat them as one "27B model" with a thinking-flag for prose density.
Coder-Next distinguishes itself by reasoning style — it trusts artifact-local signals (test names, structural hints) over external documentation. This is a strength on tasks where the test/code IS the spec, and a weakness on tasks where the spec lives in a CHANGELOG / API contract / external doc.
| Benchmark | Coder-Next | 27B | Notes |
|---|---|---|---|
dreamserver-1-pr-audit (PR #1057) |
1/3 correct verdict (cherry-picked); 2/3 wrong with fabricated evidence | 3/3 implicit-correct verdicts in review.md; 0/3 wrote verdict.md |
Spec compliance ⊥ correctness |
dreamserver-75-pr-audit |
0/5 deliverables (3 distinct degenerate failure modes) | 75 verdict files but only 3 real reviews; 72 stubs; 0 tests | Both fail at multi-hour scope |
wallstreet-intern-test |
1/3 ships (DOCU BUY) | 1/3 ships (GTLB BUY) | Verdicts not graded — opinion. Coder-Next caveat: same fabrication risk as PR audit |
27B-no-think on dreamserver-scope tasks is untested. The no-think mode's clean shipping on shorter tasks suggests it might also handle multi-section memos cleanly — but that's hypothesis, not finding.
All cost numbers are upper-bound (wall × power.limit at $0.13/kWh). Real GPU draw is lower; use for ranking, not absolute economics. Hardware specificity caveat in
KNOWN-LIMITATIONS.md.
Median wall × cost per attempt (source)
| Cell | Coder-Next | 27B (thinking) | 27B (no-think) |
|---|---|---|---|
| p2_extract | 17 s / $0.0004 | 71 s / $0.0015 | 49 s / $0.0009 |
| p2_triage | 62 s / $0.0013 | 197 s / $0.0043 | 154 s / $0.0028 |
| p2_hallucination | 422 s / $0.0092 | 171 s / $0.0037 | 127 s / $0.0023 |
| p3_business | 31 s / $0.0006 | 163 s / $0.0035 | 171 s / $0.0031 |
| p3_doc | 37 s / $0.0007 | 1113 s / $0.0201 | 144 s / $0.0026 |
| p3_market | 2294 s / $0.0435 | 1720 s / $0.0330 | 2277 s / $0.0411 |
When a model fails to ship, you still paid for the wall. Cost-per-shipped-run is the real economics:
| Cell | Coder-Next | 27B (thinking) | 27B (no-think) |
|---|---|---|---|
| p2_hallucination | $0.0318 (5/10) | $0.0045 (7/10) | $0.0023 (10/10) |
| p3_business | $0.0006 (10/10) | $0.0039 (9/9) | $0.0536 (8/10) |
| p3_doc | $0.0007 (10/10) | $0.0712 (6/8) | $0.0495 (8/10) |
| p3_market | ∞ (0/10) | $0.0459 (8/10) | $0.0493 (7/8) |
No model wins on cost across all four cells — picking by cost requires picking by task class.
| Cell | Worst arm | p95 wall | Failure |
|---|---|---|---|
| p3_doc | 27B (thinking) | 133 min / $0.144 | identical-call-loop on word-trim |
| p3_business | 27B (no-think) | 53 min / $0.057 | identical-call-loop |
| p2_hallucination | Coder-Next | 28 min / $0.036 | stuck-no-progress |
The 27B-thinking p95 of 133 min on p3_doc is the single most expensive failure mode in the dataset. 27B-no-think halves the loop rate (4/10 → 2/10).
Each arm has a signature failure profile. Knowing the shape lets you build around it.
stuck_no_workspace_change_for_500_iters— reads code without writing artifactsapi_error: HTTP Error 400— fills the 262K context budget without converging (notably onp3_market)wall_killed_low_progress_bash_loop— bash-shaped degenerate loops- Confidently-wrong-with-fabricated-evidence when it does ship — the dangerous mode (see PR #1057 evidence)
- Doesn't enter
wall_killed_identical_call_loop(different loop substrate than 27B)
wall_killed_identical_call_loop— word-budget retry loops on tight-output tasks- Subclass:
word-trim-loop(source) — write brief → count words → over budget → trim → recount → loop
- Subclass:
model_stopped(floor-failure) on some cells where no-think completes — faster-because-failed isn't a winpartial-no-spec-output— writes good content in wrong file (e.g. analysis inreview.mdinstead of requiredverdict.md)- Doesn't fabricate confident-but-wrong claims with citations
wall_killed_identical_call_loop(lower rate than thinking — 2/10 vs 4/10 onp3_doc)- Two new pathologies discovered during the no-think grid (now in
tooling/FAILURE-TAXONOMY.md):scroll-loop— model walks an HTML response in fixed-byte slices for 30+ iters; raw command hashes differ each iter so harness's same-content guard doesn't fire. Runtooling/scripts/check_substance.pyevery 5 min on long chains to catchrunaway-generation— single response exceeds max-output-tokens budget without stopping (137,855 tokens inp3_market_27b-nothink_v5)
- Caveat: lighter reasoning prose than thinking-mode. Decision quality is equivalent on hand-graded cells; provenance prose is sparser.
These would tighten the picture:
- PASS-rate grader sweep on no-think tarballs. Current data is
done_signalrate, not PASS rate. Thep3_doc8/10 ship rate could be paying real PASS rate or could be shipping briefs that miss the 700-word limit. Status: pending. - 27B-no-think on dreamserver-scope tasks. The no-think arm hasn't been run against the 1-PR or 75-PR audit. The scroll-loop monitoring methodology proven on phase-b would transfer; the verdict-production issue 27B-thinking had on PR #1057 might improve with no-think but that's a hypothesis.
- 27B-no-think on wallstreet. Untested. Given 8/10 ship on
p3_businessandp3_doc, plausible it would handle the multi-section memo cleanly — but unmeasured. - Citation validity at full sample on
p3_market27B. 18/33 URLs sampled, 75% valid. Remaining 15 unverified. - Per-claim rubric on cloud entries. Cloud Opus-4.7 / GPT-5.5 entries weren't graded with the same methodology as local entries; cloud-vs-local comparison is currently categorical only ("cloud ships, local mostly doesn't"), not per-claim accuracy. See KNOWN-LIMITATIONS.md § Comparison-to-cloud.
The findings above apply to a single operating point. Outside this point, the picture shifts in ways this study doesn't measure. Each item below is a real follow-up that contributors are welcome to pick up — see ROADMAP.md for the prioritized list.
All three arms use Cyankiwi 4-bit AWQ community quants. Multiple field reports (see KNOWN-LIMITATIONS.md § Cyankiwi 4-bit AWQ field reports) suggest these specific quants underperform the official Qwen FP8 quants and Unsloth UD4 GGUFs of the same base models — describing degraded output coherence and increased loop pathologies on certain task shapes.
What this means for the data here:
- Within-quant comparison (Coder-Next vs 27B at the same Cyankiwi 4-bit AWQ) is informative — the differential is a model-behavior gap, not a quant artifact.
- Absolute model capability at higher precision (FP8 / UD4 / BF16) is not characterized.
- Effects that depend on a thinking-mechanism (the
--no-thinkship-rate jump, the word-trim loop reduction) are unlikely to be quant-specific — they're about the trace, not the weights' precision.
The FP8 re-run is the highest-priority follow-up.
Tested at 96 GB-per-GPU. The published vLLM flags (--max-model-len 262144, --gpu-memory-utilization 0.92) will OOM on consumer 24-48 GB cards. At those tiers the choice isn't "which model wins at 4-bit AWQ" — it's "27B Q8 fits cleanly but Coder-Next needs Q4-with-CPU-offload, which dominates the wall time." That's a different study entirely; this one doesn't address it.
Comparison is Nvidia/dense-VRAM operating point. On Mac M-series unified memory the dense-vs-MoE compute tradeoff inverts: 3B-active wins on tokens-per-second (Coder-Next looks much better), full-dense compute is the bottleneck (27B looks much worse). The harness is portable — only the vLLM launch swaps for MLX — so this is a sibling study someone with M-series hardware could run.
Phase 1 coding tasks (p1_bugfix, p1_refactor, p1_testwrite) all use a Python project (logalyzer). No C, JavaScript, systems-programming, browser front-end, or low-level work tested. Coder-Next is code-specialized; its relative performance on languages it's tuned harder for is plausibly different from what shows up here. Phase 2 / Phase 3 tasks are mostly business/text and language-agnostic.
All measurements on one Blackwell rig at 500 W cap. Cross-rig variance not bounded. Power-cap effects are characterized separately in hardware-tests/vllm-power-sweep-2026-04-29/ but only on this rig.
| If you want… | Read |
|---|---|
| The full per-cell tables with Wilson CIs | microbench-phase-b-2026-05-02/findings.md |
| The original N=3 baseline (still current for the 8 non-Phase-B cells) | microbench-2026-04-28/findings.md |
| Side-by-side hand-graded deliverable quality | microbench-phase-b-2026-05-02/findings-pairwise-quality-three-model.md |
| Long-horizon agentic failure modes | dreamserver-75-pr-audit/findings-2026-04-27-local-models.md |
| The single grand summary table | SCORECARD.md |
| Caveats before quoting any number | KNOWN-LIMITATIONS.md |
| The failure-mode vocabulary | tooling/FAILURE-TAXONOMY.md |
| How to add your own model to this comparison | tooling/ADDING-A-MODEL.md |