Single-table comparison across the published entries. The detail lives in entry READMEs and findings docs; this is the synthesis.
For a head-to-head decision between Coder-Next, 27B-thinking, and 27B-no-think organized by task class, see
COMPARISON.md. This SCORECARD is the grand summary; COMPARISON is the model-selection synthesis.Read
KNOWN-LIMITATIONS.mdbefore quoting any cell. Several columns are hand-graded against ground truth where it exists, "not graded" where it doesn't. Confidence levels are noted per column.
The microbench tables below alternate between p-codes (used in receipt names and tooling) and human-readable names (used in folder names). Mapping:
| p-code | Human name | What it tests |
|---|---|---|
p1_bugfix |
bug-fixing | Fix planted bugs in logalyzer/ |
p1_refactor |
refactoring | Refactor logalyzer/ per spec |
p1_testwrite |
test-writing | Write tests for logalyzer/ |
p2_ci |
ci-failure-debugging | Diagnose + fix CI failures in discountkit/ |
p2_extract |
structured-extraction | 20-field JSON extraction from press release |
p2_hallucination |
adversarial-hallucination | Distinguish 6 real bugs from 9 fabricated |
p2_triage |
customer-support-triage | Closed-vocab classification + dup-cluster recall |
p3_business |
business-memo | Bias-detection memo from a deal pack |
p3_doc |
doc-synthesis | 700-word brief from 5 source docs |
p3_market |
market-research | 5-product comparison with cited live URLs |
p3_pm |
project-management | Workstream + risk synthesis from meeting notes |
p3_writing |
writing-editing | 3-audience rewrite of a post-mortem |
- Runs published: how many of the model's attempts on this task are represented in MMBT, vs how many were attempted total (in the source bench repo). Cherry-picked-best-of-N is the publishing default for entries where any attempt shipped; the other attempts are described in entry READMEs.
- Spec compliance: did the run produce all required artifacts the task spec asked for? Strong evidence — file existence is checkable.
- Factual accuracy: for tasks with verifiable ground truth, does the model's verdict match? Strong evidence on dreamserver-1-pr-audit (PR #1057 has a known-correct MERGE per the canonical hand-written review and the Opus-4.7 audit). Not graded on dreamserver-75-pr-audit (would need per-PR ground truth across all 75) or on wallstreet (BUY/HOLD/SELL is opinion, not verifiable as right/wrong without market hindsight).
- Fabricated claims: count of hand-graded false-but-confident technical claims in the verdict / review (e.g. citing line numbers for issues that aren't in the diff, asserting behavior the code doesn't have). Strong evidence on dreamserver-1-pr-audit. Not graded elsewhere yet — would need a per-claim rubric pass over hundreds of claims.
- Tests actually run: did the agent invoke the upstream test suite during the run? Strong evidence — visible in the transcript.
- Wall: median (or single-run) wall time. Hardware-specific (Tower2 — see KNOWN-LIMITATIONS).
- Cost (upper): upper-bound USD estimate from
cost.json. Assumes the GPU drew at itspower.limitfor the entire run; real draw is lower. Suggestive only — for ranking, not for absolute economics. - Failure mode: the primary
label.jsontaxonomy entry (tooling/FAILURE-TAXONOMY.md).
Audit 75 open PRs in a live repository and produce a traceable maintainer triage repo.
| Model | Runs | Spec | Factual accuracy | Fabricated | Tests | Wall | Cost | Failure mode |
|---|---|---|---|---|---|---|---|---|
| Opus-4.7 (cloud) | 1/1 | ✓ full | not graded (would need per-PR ground truth across all 75) | not graded | not visible from artifacts | ~5 hr | n/a | success-shipped |
| GPT-5.5 (cloud) | 1/1 | ✓ full + verify_coverage.py self-check passes |
not graded | not graded | not visible from artifacts | not recorded | n/a | success-shipped |
| Qwen3.6-27B-AWQ (local) | 1/3 published | △ 75/75 verdict.md files but only 3 are real reviews; 72 are template stubs | partial (3 reviewed PRs match ground truth; 72 stubs unverified) | 0 in the 3 deep reviews | 0 | 24 min | $0.031 | scaffold-and-stop |
| Qwen3-Coder-Next-AWQ (local) | 0/5 | ✗ no deliverable across 5 attempts | n/a | n/a | n/a | 1-42 min | $0.001-$0.054 | identical-call-loop, cyclic-name-slop, stuck-in-research |
Same task spec scaled to a single PR. Ground truth on PR #1057 established three independent ways: canonical hand-written review, the actual public diff, Opus-4.7's audit. All three agree: MERGE. The catalog-handling architectural distinction (
_handle_model_listvs_handle_model_download) is the trap that separates surface-pattern matchers from architectural readers.
| Model | Runs | Spec | Factual accuracy | Fabricated | Tests | Wall | Cost | Failure mode |
|---|---|---|---|---|---|---|---|---|
| Qwen3-Coder-Next-AWQ | 1/3 published (cherry-picked correct) | ✓ 13/13 files, tag, done() | 2/3 wrong across the three runs (this entry's run is the 1 correct; v1 and v3 said REJECT incorrectly) | 1 in v1, 4 in v3 including a fabricated test_stderr_truncation.py |
repro script, no execution | 3 min | $0.004 | success-shipped (cherry-picked) |
| Qwen3.6-27B-AWQ | 1/3 published | ✗ 7/13 files; no verdict.md, no tag, no done() in any of 3 runs | 3/3 implicit-MERGE-correct (in review.md's Summary of Findings table; never in a verdict.md) |
0 | pytest invoked, 38 tests on both branches | 7 min | $0.009 | partial-no-spec-output |
| Qwen3.6-35B-A3B-AWQ | 1/1 | ✗ 0/13 — zero artifacts | n/a (nothing produced) | n/a | pytest run but no artifacts written | 1.7 min | $0.002 | floor-failure |
Build a complete investment memo on any publicly traded US company with $1B-$10B market cap. Every number traceable from raw source → model → memo. Recommendations (BUY/HOLD/SELL) are opinion, not graded as right/wrong — the verifiable axes are spec compliance, source traceability, and fabrication count.
| Model | Company | Rec | Runs | Spec | Factual accuracy | Fabricated | Wall | Cost | Failure mode |
|---|---|---|---|---|---|---|---|---|---|
| Opus-4.7 (cloud) | Vita Coco (COCO) |
HOLD ($46 vs $52 spot) | 1/1 | ✓ full memo + machine-readable verification | not graded (opinion) | not graded | not recorded | n/a | success-shipped |
| GPT-5.5 (cloud) | YETI Holdings (YETI) |
HOLD ($41) | 1/1 | ✓ full memo + verification + board-deck follow-on | not graded (opinion) | not graded | not recorded | n/a | success-shipped |
| Qwen3.6-27B-AWQ (local) | GitLab (GTLB) |
BUY | 1/3 published | ✓ full memo + 17 KB XLSX | not graded (opinion) | not graded | 27 min | $0.032 | success-shipped (cherry-picked) |
| Qwen3-Coder-Next-AWQ (local) | DocuSign (DOCU) |
BUY | 1/3 published | ✓ full memo + 10.6 KB XLSX | not graded (opinion). Caveat: this model's PR-audit verdicts were 2/3 wrong with fabricated evidence; same risk likely extends to BUY calls | not graded | 11 min | $0.013 | success-shipped (cherry-picked) |
| Qwen3.6-35B-A3B-AWQ (local) | — | — | 0/3 | ✗ no usable deliverable | n/a | n/a | 0.2-7 min | $0.0002-$0.0085 | floor-failure / api-error / stuck-in-research |
Smaller-scope task families than the dreamserver/wallstreet benchmarks above — each task is a 5-30 minute deliverable rather than a multi-hour audit. Phase 1 = coding (programmatic graders). Phase 2 = structured business tasks (programmatic graders). Phase 3 = unbounded business/writing tasks (mix of programmatic + hand-grading placeholders). N=3 per cell. Cherry-picked-best-of-N is not the publishing default here — the table reports N=3 PASS rates so variance is visible. See
benchmarks/microbench-2026-04-28/findings.mdfor cross-cutting analysis. Entries published only for the most signal-rich task families to avoid bloating the repo with 60+ tiny folders; full results table reproduced below for completeness.Two task-design issues called out separately:
p1_testwriteandp1_refactoruse a shared starter (logalyzer/) with a known broken import (from collections import Iterable— Python 3.10+ removed this). Both models 0/3 PASS on these — but the failure is fixing-the-starter-vs-task-scope tension, not pure model failure. See findings doc § "Test-writing and refactoring task-design issue".
| Phase | Task | 27B PASS | Coder-Next PASS | 27B median wall | Coder median wall | 27B median cost | Coder median cost | Notable |
|---|---|---|---|---|---|---|---|---|
| 1 | bug-fixing (logalyzer) | 3/3 | 2/3 | 18.0 min | 11.5 min | $0.023 | $0.015 | both ship; coder-v3 killed at iter 540 (post-completion drift) |
| 1 | test-writing (logalyzer) | 0/3 † | 0/3 † | 9.6 min | 14.0 min | $0.012 | $0.018 | task-design issue (broken import) — see caveat |
| 1 | refactoring (logalyzer) | 0/3 † | 0/3 † | 5.4 min | 5.4 min | $0.007 | $0.007 | task-design issue — see caveat |
| 2 | structured extraction | 3/3 | 3/3 | 1.2 min | 0.3 min | $0.0015 | $0.0004 | 27B 100% on 20 fields; coder ~92% |
| 2 | CI failure debugging | 3/3 | 3/3 | 2.1 min | 1.2 min | $0.003 | $0.0015 | both clean; coder cheaper |
| 2 | adversarial hallucination | 3/3 | 1/3 | 3.4 min | 25.9 min | $0.004 | $0.034 | 27B 100% / 0 dangerous; coder 2/3 stuck-detector fired, ship-with-2-dangerous-errors |
| 2 | customer support triage | 3/3 | 3/3 | 3.3 min | 1.0 min | $0.004 | $0.0013 | coder 96.7% category, 27B 86.7% (both 100% dup-cluster recall) |
| 3 | document synthesis | 0/3 †† | 2/3 | 32.7 min ‡ | 0.6 min | $0.043 ‡ | $0.0008 | 27B 8/8 facts every run but couldn't trim to 700 words (765, 775, 768); 2 of 3 stuck in identical-call-loop trying to trim. coder hit limit 2/3. |
| 3 | business memo | 2/3 | 3/3 | 2.8 min | 0.5 min | $0.0037 | $0.0007 | both 8/8 bias signals every run; 27B v3 hit 708 words (1 over) |
| 3 | market research | 3/3 ★ | 0/3 | 18.9 min | 19.1 min | $0.025 | $0.025 | 27B drives the internet-research workflow Coder-Next doesn't. All 3 27B runs evaluated all 5 products with 12-18 inline cites to 29-33 distinct URLs. Coder-Next 0/3 STRUCTURAL_FAIL across all 3 runs. |
| 3 | writing/editing (3-audience rewrite) | 0/3 | 2/3 | 2.8 min | 0.4 min | $0.0036 | $0.0005 | 27B 0/3 all single-subdimension fails (customer_email missing required keyword); ceo_brief + legal_summary PASS in all 3 |
| 3 | project management synthesis | 0/3 | 1/3 | 1.3 min | 0.3 min | $0.0017 | $0.0003 | both: workstreams 6/6 every run, but only 2-3/6 risks recalled (multi-week risks missed) |
†
p1_testwrite/p1_refactorfailures are correlated with starter-codebase task-design issue; see microbench findings doc § "Test-writing and refactoring task-design issue" before drawing model-quality conclusions from these rows.†† All 3 27B doc-synthesis runs captured all 8 planted facts but couldn't trim to the 700-word limit. 2 of 3 (v2, v3) hit identical-call-loop on the same brief.md content for 50-130+ iters and were manually advanced to keep the chain moving. Pattern is a documented 27B failure shape, not a transient bug.
‡ 27B doc-synthesis median wall is dominated by the wall-killed v2/v3 runs (32.7 min, $0.043). The cleanly-completed v1 was 8 min / $0.011.
★ Inversion vs the prior expectation in the findings doc: 27B can drive sustained internet-research workflows that Coder-Next doesn't. Citation-validity pass (18 of 33 URLs from
p3_market_27b_v1validated on 2026-04-28): 9 strong-valid (factual claim exactly matches live page), 3 partial-valid (claim mostly right with minor specificity issues), 2 confirmed-wrong URLs (404), 4 inaccessible to the validator. Of 14 testable URLs, 12 (86%) are mostly-valid and 9 (64%) are strict-valid. Measuredcitations_valid_pct = 75(was 90 estimate).fabricated_stats_count = 0— every checkable factual claim (prices, certifications, products) matched live data. Critical observation: the error mode is URL drift (wrong or dead URLs cited), not fabricated facts — a meaningfully different failure shape than the dreamserver-1-pr-audit Coder-Next variance that fabricated technical evidence with confident citations.
Headline reads from this table (post 27B Phase 3 completion):
- 27B is reliable on tight-schema tasks. Phase 2's 12 programmatic-graded runs: 12/12 PASS. The "27B doesn't ship" framing from the dreamserver-PR-audit benchmark was task-class-specific — when the deliverable is a constrained-shape JSON or markdown-with-clear-keys, 27B ships cleanly.
- 27B has a documented word-limit-trim failure mode. Doc-synthesis: 8/8 planted facts captured every single run, but 0/3 PASS because the model cannot reliably compress to a tight word limit. 2 of 3 runs entered identical-call-loops trying. Coder-Next handled this better (2/3 PASS).
- Big inversion on market research. 27B was 3/3 STRUCTURAL_PASS (5-product evaluations, 12-18 inline cites, 29-33 distinct URLs); Coder-Next was 0/3 STRUCTURAL_FAIL. Internet-research workflows aren't hopeless for local models — they're a 27B strength, just not Coder-Next's. (Citation validity is hand-grading placeholder; this is structural completion only.)
- Coder-Next has a real hallucination-resistance gap. Adversarial-hallucination: 27B 3/3 100% accurate / 0 dangerous; Coder-Next 1/3 with the one ship-attempt landing 2 confirmed-fabrications-as-real (right at the safety threshold). Same failure shape as the documented dreamserver-1-pr-audit Coder-Next variance.
- Cost-per-attempt: Coder-Next is 4-12× cheaper when it ships. When it doesn't ship (stuck-detector cases), it spends 25+ minutes and ~$0.03 producing nothing, which inverts the economics for hallucination-resistance-required tasks.
- Both miss multi-week risks on PM-synthesis. Project management: workstream + decision recall is excellent (6/6 + 3-4/4 every run for both models), but risks 2-3/6 across all runs and both models — multi-week-spanning risks systematically dropped.
Update (2026-05-02):
microbench-phase-b-2026-05-02bumps the four highest-signal cells of this table to N=10 with proper Wilson 95% CIs, and adds 27B-no-think as a third arm across the full 12-family grid. Several N=3 hints from the table above are now bounded — see § "microbench-phase-b-2026-05-02" below.
Bumps the 4 differential cells from N=3 → N=10 and adds 27B-no-think across all 12 families. ~240 runs total. See
benchmarks/microbench-phase-b-2026-05-02/findings.mdfor full breakdown.
| Model | Coverage | Ship rate | Wilson 95% CI |
|---|---|---|---|
| Qwen3-Coder-Next-AWQ | 4 cells × N=10 + 8 cells × N=3 = 63 runs | 47/63 = 74.6% | [62.5%, 83.9%] |
| Qwen3.6-27B-AWQ (thinking) | 4 cells × N=10 + 8 cells × N=3 = 62 runs | 46/62 = 74.2% | [62.0%, 83.7%] |
| Qwen3.6-27B-AWQ (no-think) | 12 cells × N=10 = 118 graded + 2 op-labeled | 113/118 = 95.8% | [90.5%, 98.2%] |
Single row per (cell × model) with ship rate, median wall, median cost, $/shipped-run, and primary failure mode. Cost numbers are upper-bound (wall × power.limit at $0.13/kWh).
| Cell | Model | Ship | Median wall | Median $ | $/ship | Primary failure mode |
|---|---|---|---|---|---|---|
| p2_hallucination | Coder-Next | 5/10 | 422 s | $0.0092 | $0.032 | stuck_no_workspace_change_for_500_iters (5/10) |
| p2_hallucination | 27B (thinking) | 7/10 | 171 s | $0.0037 | $0.0045 | (none on the 7 ships; 3 model_stopped) |
| p2_hallucination | 27B (no-think) | 10/10 | 127 s | $0.0023 | $0.0023 | none |
| p3_business | Coder-Next | 10/10 | 31 s | $0.0006 | $0.0006 | none |
| p3_business | 27B (thinking) | 9/10 | 163 s | $0.0035 | $0.0039 | 1 wall_killed_identical_call_loop |
| p3_business | 27B (no-think) | 8/10 | 171 s | $0.0031 | $0.0536 | 2 wall_killed_identical_call_loop |
| p3_doc | Coder-Next | 10/10 | 37 s | $0.0007 | $0.0007 | none |
| p3_doc | 27B (thinking) | 6/10 | 1113 s | $0.0201 | $0.0712 | 4 wall_killed_identical_call_loop (word-trim) |
| p3_doc | 27B (no-think) | 8/10 ★ | 144 s | $0.0026 | $0.0495 | 2 wall_killed_identical_call_loop (word-trim, halved) |
| p3_market | Coder-Next | 0/10 | 2294 s | $0.0435 | ∞ | 5 stuck + 4 api_error: HTTP 400 + 1 wall_killed |
| p3_market | 27B (thinking) | 8/10 | 1720 s | $0.0330 | $0.046 | 2 api_error: timed out (transient) |
| p3_market | 27B (no-think) | 7/10 | 2277 s | $0.0411 | $0.049 | 1 runaway-gen + 2 op-SIGTERM scroll-loop |
★
p3_doc27B-no-think 8/10 vs 6/10 thinking-mode is the standout finding — disabling thinking halves the word-limit-trim loop rate (4/10 → 2/10).
Reading the table for a deployment decision:
- Lowest $/ship for a given cell:
p2_hallucination→ 27B-no-thinkp3_business→ Coder-Next (60-100× cheaper than 27B variants)p3_doc→ Coder-Next (70× cheaper than 27B variants)p3_market→ 27B-thinking (Coder-Next is unusable; 27B-no-think slightly cheaper but with higher pathology rate)
- Highest reliability per cell: 27B-no-think on
p2_hallucination(10/10), Coder-Next onp3_business/p3_doc(10/10), 27B-thinking onp3_market(8/10). - No single model wins all four cells. Mixed-model deployment is justified by this data if you care about either ship rate or $/ship across all four.
- 27B-no-think is the most reliable shipper of the three on like-for-like cells (86.8% vs 75% vs 62.5%). The pre-Phase-B framing of "27B vs Coder-Next" needs a third arm — for tasks where ship rate matters more than thinking-mode polish, no-think 27B is the operational pick.
- 27B-no-think rescues
p3_docfrom the documented 27B word-trim loop (4/10 wall_killed → 2/10 wall_killed). - Coder-Next's
p3_market0/3 → 0/10 at N=10 confirmed as a stable failure shape, Wilson 95% [0%, 27.8%]. Coder-Next does not drive internet-research workflows. - Coder-Next's
p2_hallucination1/3 PASS → 5/10 stuck at N=10, Wilson 95% [23.7%, 76.3%] — bounded as a real ~50% failure shape, not a 1-of-N flake. - Two new pathologies surfaced (now in
tooling/FAILURE-TAXONOMY.md):scroll-loop(sub-label ofidentical-call-loop) — model walks an HTML response in fixed-byte slices; raw command hashes differ so the harness's content-hash same-content guard doesn't fire. Caught inp3_market_27b-nothink_v1(155 iters) and_v8(31 iters).runaway-generation(new primary) — single model response exceeds the harness's max-output-tokens budget without stopping. Caught inp3_market_27b-nothink_v5(137,855 tokens).
- Ship rate ≠ PASS rate. PASS-rate analysis pending the batch-grader sweep against the no-think tarballs.
- Cross-batch comparisons on N=3 P1 cells include harness-drift effects (different file_sha256 between batches). Within the 4 N=10 differential cells, harness is consistent across all three model arms.
Strong claims:
- Cloud entries (Opus-4.7, GPT-5.5) reliably ship complete deliverables on all three benchmarks. Local 30B-class quantized entries do not.
- Spec-compliance and verdict-accuracy are different axes. On
dreamserver-1-pr-audit, Coder-Next has 100% spec compliance and 33% factual accuracy. 27B has 0% spec compliance (noverdict.md) and 100% factual accuracy in the implicit verdicts present inreview.md. From the artifact alone you can't tell which mode you're in for any given Coder-Next run; the wrong runs include fabricated evidence (line citations to non-existent issues, fake test scripts). - Cost-per-attempt at N=1: Coder-Next is ~4× cheaper than 27B. For an ensemble-with-verification deployment shape, the economics favor running Coder-Next 3+ times and verifying than running 27B once.
- 35B-A3B-AWQ at 4-bit is below the floor for these tasks: 0 of 3 wallstreet attempts shipped; 0 of 1 dreamserver-1-pr-audit attempts shipped. Higher-precision quantizations untested.
Weaker / not-yet-supported:
- "Coder-Next is X% wrong on PR review in general" — current evidence is 2/3 wrong on a single PR. Need more PRs and more N to pin a real rate.
- "27B is reliably better than Coder-Next for analytical work" — likely true but evidence is qualitative (the 3 hand-written reviews on
dreamserver-75-pr-audit/Qwen3.6-27B-AWQ/are clean; 27B'sreview.mdcontent on PR #1057 is excellent). Phase 3 hand-grading sharpens this: 27B prose quality 5/5 on doc-synthesis, business-memo bias-pushback 5/5; Coder-Next 4/5 on the same axes. - "Cloud models are N× better than local on this benchmark" — categorical gap is clear (cloud ships, local mostly doesn't), but per-claim accuracy for the cloud entries isn't graded with the same methodology used on the local entries.
- "27B citations on the market-research microbench are valid" — sampled 18 of 33 URLs (~55%) validated. 86% mostly-valid / 64% strict-valid out of 14 testable URLs (4 were inaccessible to the validator from this IP). Measured
citations_valid_pct = 75. Important nuance: factual content (prices, certifications) is 100% accurate in the validated sample; the error mode is URL drift, not fabrication. The remaining 15 URLs are unverified — sample is large enough to assert most citations are valid but not "all 33."
The recommendations below are conditional on the task class this benchmark covers — long-horizon agentic work, structured deliverables, real-world-shaped tasks. They don't speak to interactive chat, single-question Q&A, or coding completion. For those, this benchmark has no signal.
- Hallucination resistance is required. The single sharpest local-model superiority signal in this repo: on the adversarial-hallucination microbench (15 issues, 6 real / 9 fabricated, agent must classify), 27B was 3/3 PASS with 100% accuracy and 0 dangerous errors; Coder-Next was 1/3 PASS with 2 confirmed-fabrications-as-real on the one shipping run. For security review, factual research, anything where confidently-wrong is dangerous, 27B is the pick.
- Internet-research-driven workflows. The second-sharpest local-model superiority signal: market-research microbench saw 27B 3/3 STRUCTURAL_PASS (5 products, 12-18 inline cites to 29-33 distinct URLs) and Coder-Next 0/3 STRUCTURAL_FAIL. 27B drives sustained multi-step research that Coder-Next doesn't. Caveat: STRUCTURAL_PASS only — sample-grade citations rather than consuming blind.
- Tight-schema structured tasks. 27B was 100% on 20-field extraction across 3 runs, 100% duplicate-cluster recall on triage, 12/12 PASS on Phase 2 programmatic graders. The "27B doesn't ship" framing from the dreamserver-PR-audit benchmark turned out to be task-class-specific (unbounded markdown narrative). When the deliverable shape is constrained, 27B ships cleanly.
- Human-in-the-loop review work where intermediate analysis matters more than spec-compliant deliverables. 27B's
review.mdandresearch/questions.mdcontent was the highest-quality across all six N=1 runs on PR #1057. A reviewer reading that as research notes would get a substantively correct read on the catalog-handling distinction. - Tasks where truthfulness matters more than artifact obedience. 27B's failures are usually "didn't finish" or "didn't write the spec-required file"; it doesn't fabricate confident-but-wrong claims with citations.
- Tasks where you control the downstream consumption. If your pipeline expects markdown notes (not a structured
verdict.mdJSON), 27B's output is usable as-is.
The third arm introduced in microbench-phase-b-2026-05-02 — same base model as 27B-thinking, but with the <think> trace disabled. Has become the recommended default for many cells based on the N=10 expansion data:
- Default for most non-coding tasks at N=10 ship-rate grain. 27B-no-think hit 95.8% across the full 12-cell grid (Wilson 95% [90.5%, 98.2%]) and 86.8% on the 4 hardest cells. If you're picking a single local model for a bulk run and you're not specifically trying to extract a polished reasoning narrative, no-think 27B ships more reliably than either thinking-mode 27B or Coder-Next.
- Doc synthesis with a tight word limit. Halves the documented 27B word-trim loop rate (4/10 wall_killed → 2/10 on
p3_doc). The mechanism: thinking-mode amplifies "deliberate-without-progressing" loops; no-think writes once and commits. - Adversarial hallucination. 27B-no-think 10/10 ship at $0.0023/run — the cleanest result of any arm on this cell. Beats thinking-mode 7/10 and Coder-Next 5/10.
- Anything where you'd otherwise pick 27B-thinking. Per the pairwise quality study, no-think and thinking are substantively equivalent on hand-graded deliverable correctness — the difference is verbosity of reasoning prose, not output decisions. For decision-making, treat them as one "27B model" with a thinking-flag for prose density.
When to prefer thinking over no-think:
- When you specifically want dense reasoning prose alongside the deliverable (e.g. for human reviewers tracing the model's logic). No-think output is leaner.
- On
p3_businessat the margin (9/10 thinking vs 8/10 no-think — within sampling noise; either works).
Caveats:
- PASS rate not yet measured on no-think tarballs — the higher ship rate could be paying real PASS rate or could be shipping briefs/memos that don't quite meet spec (e.g. exceeding word limits). Grader sweep pending. This makes the no-think headline provisional on the harder cells.
- Two new pathologies surfaced during the no-think grid:
scroll-loop(model walks an HTML response in fixed-byte slices) andrunaway-generation(single response exceeds max-output-tokens budget). Both are caught bytooling/scripts/check_substance.pybut not by the harness's own stuck-detector — operator monitoring required on long chains. - Untested at dreamserver-scope — the no-think arm hasn't been run against the 1-PR or 75-PR audit. Hypothesized to help with the verdict.md production issue 27B-thinking had on PR #1057, but unmeasured.
- Pipelines where artifact shape is required and a verifier exists. If your downstream consumes
verdict.md/tag/done()and you have a separate check for correctness (a second model, a human pass, regression tests), Coder-Next ships reliably and is ~4× cheaper than 27B. - Bounded business-memo, triage, and writing-rewrite tasks. Coder-Next was 3/3 PASS on business-memo (bias-signal recall), 3/3 PASS on triage at 96.7% category accuracy (better than 27B's 86.7%), 2/3 PASS on writing-editing. Below the cost-per-attempt of 27B by 4-12× and competitive on accuracy.
- Ensemble-with-verification setups. Run Coder-Next 3-5 times on the same input, take majority vote, flag dissent for human review. The variance characteristic is documented; you can build around it.
- Time-to-output matters. ~3 min/attempt at N=1 vs ~7 min for 27B. If artifact completion + speed beats verdict-accuracy in your loss function, Coder-Next is the pick.
- Single-shot autonomous high-stakes verdicts (security review, financial recommendations consumed without verification, anything cited downstream). Coder-Next's fabrication risk is documented; 27B's no-ship risk means the verdict you'd cite isn't there in machine-readable form.
- Long-horizon (>30 min) unattended work. Both models find degenerate failure modes within 30-60 minutes on the 75-PR task. Coder-Next loops; 27B Goodharts the spec or hits the per-response token cap.
- Internet-research-driven workflows on Coder-Next specifically. Coder-Next was 0/3 STRUCTURAL_FAIL on the market-research microbench (stuck-in-research, api-error). 27B does fine on the same task — see "When to use 27B" above. If you only have Coder-Next, have a human gather sources first.
- Tasks where fabricated-but-plausible technical claims are dangerous. If a wrong cited line number or invented test would mislead someone with cleanup cost > the win from automation, don't use Coder-Next single-shot. 27B is safer here, but its no-ship failure means the real review work has to be done by hand.
- The current evidence in this repo argues against using it for this task class. 0 of 3 wallstreet attempts shipped; 0 of 1 PR-audit attempts produced any artifact. Higher-precision quantizations might help; 4-bit AWQ at 3B active params doesn't clear the floor.
- Long-horizon autonomous work where correctness matters and verification budget is limited. Both cloud entries shipped complete deliverables on all three benchmarks. The categorical gap to local is large — we observed local-model fabrication and incomplete deliverables across local entries; cloud entries didn't show those failure shapes.
- Tasks where artifact completion is non-negotiable. Cloud entries reliably produce the spec-shaped output. Local entries don't.
- This said, the cloud entries here aren't graded with the same per-claim methodology used on the local entries (see KNOWN-LIMITATIONS § comparison-to-cloud). The cloud-vs-local gap is currently established at the categorical level (shipping vs not), not at the per-claim accuracy level.
These additions would tighten the recommendations above; until they land, the recommendations are best read as "based on this evidence" rather than "definitive":
- Validate the remaining 15 unsampled URLs and the 4 inaccessible-from-this-validator URLs on the 27B market-research entry. Currently 18/33 sampled, measured citations_valid_pct = 75. The 4 inaccessible URLs (PCMag, ZDNet, two LastPass pages) are blocked by Cloudflare from the validator's IP — they could be sampled from a different IP, or the agent's specific cited content could be cross-referenced from archive.org. Would tighten the measured number from 75 (sample) to a fully-measured rate.
- Per-claim rubric applied uniformly to cloud entries. Phase 3 hand-grading is now done for the local entries (prose, stance, source skepticism, balance, citations, tone fit, faithfulness, fabrication count); cloud entries (Opus-4.7, GPT-5.5) on the older benchmarks haven't been graded with the same rubric. Would let cloud-vs-local comparisons go beyond "shipping vs not."
- Failed-run artifacts published (receipts + transcripts for the 5+ unsuccessful local-model runs not currently in MMBT). Would let a reader see expected failure modes per model.
- N=10+ on the highest-signal cells (Coder-Next on
dreamserver-1-pr-audit, 27B on the same; both onmicrobench-2026-04-28/adversarial-hallucination). Would bound the variance the current N=3 only suggests. - Different PR shapes in the dreamserver-1-pr-audit family — the current PR has subtle architectural distinctions; a docs-only PR or a security PR would test different failure modes.
- Higher-precision quantizations of the same models (FP8, BF16). Particularly for 35B-A3B which fails at 4-bit; might be a quantization-headroom issue rather than a base-model issue.
None of these are in scope for the current MMBT publication. They're separate experiments.