Skip to content

Latest commit

 

History

History
262 lines (196 loc) · 31.9 KB

File metadata and controls

262 lines (196 loc) · 31.9 KB

Scorecard

Single-table comparison across the published entries. The detail lives in entry READMEs and findings docs; this is the synthesis.

For a head-to-head decision between Coder-Next, 27B-thinking, and 27B-no-think organized by task class, see COMPARISON.md. This SCORECARD is the grand summary; COMPARISON is the model-selection synthesis.

Read KNOWN-LIMITATIONS.md before quoting any cell. Several columns are hand-graded against ground truth where it exists, "not graded" where it doesn't. Confidence levels are noted per column.

Cell-name legend (microbench)

The microbench tables below alternate between p-codes (used in receipt names and tooling) and human-readable names (used in folder names). Mapping:

p-code Human name What it tests
p1_bugfix bug-fixing Fix planted bugs in logalyzer/
p1_refactor refactoring Refactor logalyzer/ per spec
p1_testwrite test-writing Write tests for logalyzer/
p2_ci ci-failure-debugging Diagnose + fix CI failures in discountkit/
p2_extract structured-extraction 20-field JSON extraction from press release
p2_hallucination adversarial-hallucination Distinguish 6 real bugs from 9 fabricated
p2_triage customer-support-triage Closed-vocab classification + dup-cluster recall
p3_business business-memo Bias-detection memo from a deal pack
p3_doc doc-synthesis 700-word brief from 5 source docs
p3_market market-research 5-product comparison with cited live URLs
p3_pm project-management Workstream + risk synthesis from meeting notes
p3_writing writing-editing 3-audience rewrite of a post-mortem

What the columns mean

  • Runs published: how many of the model's attempts on this task are represented in MMBT, vs how many were attempted total (in the source bench repo). Cherry-picked-best-of-N is the publishing default for entries where any attempt shipped; the other attempts are described in entry READMEs.
  • Spec compliance: did the run produce all required artifacts the task spec asked for? Strong evidence — file existence is checkable.
  • Factual accuracy: for tasks with verifiable ground truth, does the model's verdict match? Strong evidence on dreamserver-1-pr-audit (PR #1057 has a known-correct MERGE per the canonical hand-written review and the Opus-4.7 audit). Not graded on dreamserver-75-pr-audit (would need per-PR ground truth across all 75) or on wallstreet (BUY/HOLD/SELL is opinion, not verifiable as right/wrong without market hindsight).
  • Fabricated claims: count of hand-graded false-but-confident technical claims in the verdict / review (e.g. citing line numbers for issues that aren't in the diff, asserting behavior the code doesn't have). Strong evidence on dreamserver-1-pr-audit. Not graded elsewhere yet — would need a per-claim rubric pass over hundreds of claims.
  • Tests actually run: did the agent invoke the upstream test suite during the run? Strong evidence — visible in the transcript.
  • Wall: median (or single-run) wall time. Hardware-specific (Tower2 — see KNOWN-LIMITATIONS).
  • Cost (upper): upper-bound USD estimate from cost.json. Assumes the GPU drew at its power.limit for the entire run; real draw is lower. Suggestive only — for ranking, not for absolute economics.
  • Failure mode: the primary label.json taxonomy entry (tooling/FAILURE-TAXONOMY.md).

dreamserver-75-pr-audit

Audit 75 open PRs in a live repository and produce a traceable maintainer triage repo.

Model Runs Spec Factual accuracy Fabricated Tests Wall Cost Failure mode
Opus-4.7 (cloud) 1/1 ✓ full not graded (would need per-PR ground truth across all 75) not graded not visible from artifacts ~5 hr n/a success-shipped
GPT-5.5 (cloud) 1/1 ✓ full + verify_coverage.py self-check passes not graded not graded not visible from artifacts not recorded n/a success-shipped
Qwen3.6-27B-AWQ (local) 1/3 published △ 75/75 verdict.md files but only 3 are real reviews; 72 are template stubs partial (3 reviewed PRs match ground truth; 72 stubs unverified) 0 in the 3 deep reviews 0 24 min $0.031 scaffold-and-stop
Qwen3-Coder-Next-AWQ (local) 0/5 ✗ no deliverable across 5 attempts n/a n/a n/a 1-42 min $0.001-$0.054 identical-call-loop, cyclic-name-slop, stuck-in-research

dreamserver-1-pr-audit (PR #1057, known-correct verdict: MERGE)

Same task spec scaled to a single PR. Ground truth on PR #1057 established three independent ways: canonical hand-written review, the actual public diff, Opus-4.7's audit. All three agree: MERGE. The catalog-handling architectural distinction (_handle_model_list vs _handle_model_download) is the trap that separates surface-pattern matchers from architectural readers.

Model Runs Spec Factual accuracy Fabricated Tests Wall Cost Failure mode
Qwen3-Coder-Next-AWQ 1/3 published (cherry-picked correct) ✓ 13/13 files, tag, done() 2/3 wrong across the three runs (this entry's run is the 1 correct; v1 and v3 said REJECT incorrectly) 1 in v1, 4 in v3 including a fabricated test_stderr_truncation.py repro script, no execution 3 min $0.004 success-shipped (cherry-picked)
Qwen3.6-27B-AWQ 1/3 published ✗ 7/13 files; no verdict.md, no tag, no done() in any of 3 runs 3/3 implicit-MERGE-correct (in review.md's Summary of Findings table; never in a verdict.md) 0 pytest invoked, 38 tests on both branches 7 min $0.009 partial-no-spec-output
Qwen3.6-35B-A3B-AWQ 1/1 ✗ 0/13 — zero artifacts n/a (nothing produced) n/a pytest run but no artifacts written 1.7 min $0.002 floor-failure

wallstreet-intern-test

Build a complete investment memo on any publicly traded US company with $1B-$10B market cap. Every number traceable from raw source → model → memo. Recommendations (BUY/HOLD/SELL) are opinion, not graded as right/wrong — the verifiable axes are spec compliance, source traceability, and fabrication count.

Model Company Rec Runs Spec Factual accuracy Fabricated Wall Cost Failure mode
Opus-4.7 (cloud) Vita Coco (COCO) HOLD ($46 vs $52 spot) 1/1 ✓ full memo + machine-readable verification not graded (opinion) not graded not recorded n/a success-shipped
GPT-5.5 (cloud) YETI Holdings (YETI) HOLD ($41) 1/1 ✓ full memo + verification + board-deck follow-on not graded (opinion) not graded not recorded n/a success-shipped
Qwen3.6-27B-AWQ (local) GitLab (GTLB) BUY 1/3 published ✓ full memo + 17 KB XLSX not graded (opinion) not graded 27 min $0.032 success-shipped (cherry-picked)
Qwen3-Coder-Next-AWQ (local) DocuSign (DOCU) BUY 1/3 published ✓ full memo + 10.6 KB XLSX not graded (opinion). Caveat: this model's PR-audit verdicts were 2/3 wrong with fabricated evidence; same risk likely extends to BUY calls not graded 11 min $0.013 success-shipped (cherry-picked)
Qwen3.6-35B-A3B-AWQ (local) 0/3 ✗ no usable deliverable n/a n/a 0.2-7 min $0.0002-$0.0085 floor-failure / api-error / stuck-in-research

microbench-2026-04-28 (12 task families × 2 models × N=3)

Smaller-scope task families than the dreamserver/wallstreet benchmarks above — each task is a 5-30 minute deliverable rather than a multi-hour audit. Phase 1 = coding (programmatic graders). Phase 2 = structured business tasks (programmatic graders). Phase 3 = unbounded business/writing tasks (mix of programmatic + hand-grading placeholders). N=3 per cell. Cherry-picked-best-of-N is not the publishing default here — the table reports N=3 PASS rates so variance is visible. See benchmarks/microbench-2026-04-28/findings.md for cross-cutting analysis. Entries published only for the most signal-rich task families to avoid bloating the repo with 60+ tiny folders; full results table reproduced below for completeness.

Two task-design issues called out separately: p1_testwrite and p1_refactor use a shared starter (logalyzer/) with a known broken import (from collections import Iterable — Python 3.10+ removed this). Both models 0/3 PASS on these — but the failure is fixing-the-starter-vs-task-scope tension, not pure model failure. See findings doc § "Test-writing and refactoring task-design issue".

Phase Task 27B PASS Coder-Next PASS 27B median wall Coder median wall 27B median cost Coder median cost Notable
1 bug-fixing (logalyzer) 3/3 2/3 18.0 min 11.5 min $0.023 $0.015 both ship; coder-v3 killed at iter 540 (post-completion drift)
1 test-writing (logalyzer) 0/3 † 0/3 † 9.6 min 14.0 min $0.012 $0.018 task-design issue (broken import) — see caveat
1 refactoring (logalyzer) 0/3 † 0/3 † 5.4 min 5.4 min $0.007 $0.007 task-design issue — see caveat
2 structured extraction 3/3 3/3 1.2 min 0.3 min $0.0015 $0.0004 27B 100% on 20 fields; coder ~92%
2 CI failure debugging 3/3 3/3 2.1 min 1.2 min $0.003 $0.0015 both clean; coder cheaper
2 adversarial hallucination 3/3 1/3 3.4 min 25.9 min $0.004 $0.034 27B 100% / 0 dangerous; coder 2/3 stuck-detector fired, ship-with-2-dangerous-errors
2 customer support triage 3/3 3/3 3.3 min 1.0 min $0.004 $0.0013 coder 96.7% category, 27B 86.7% (both 100% dup-cluster recall)
3 document synthesis 0/3 †† 2/3 32.7 min ‡ 0.6 min $0.043 ‡ $0.0008 27B 8/8 facts every run but couldn't trim to 700 words (765, 775, 768); 2 of 3 stuck in identical-call-loop trying to trim. coder hit limit 2/3.
3 business memo 2/3 3/3 2.8 min 0.5 min $0.0037 $0.0007 both 8/8 bias signals every run; 27B v3 hit 708 words (1 over)
3 market research 3/3 ★ 0/3 18.9 min 19.1 min $0.025 $0.025 27B drives the internet-research workflow Coder-Next doesn't. All 3 27B runs evaluated all 5 products with 12-18 inline cites to 29-33 distinct URLs. Coder-Next 0/3 STRUCTURAL_FAIL across all 3 runs.
3 writing/editing (3-audience rewrite) 0/3 2/3 2.8 min 0.4 min $0.0036 $0.0005 27B 0/3 all single-subdimension fails (customer_email missing required keyword); ceo_brief + legal_summary PASS in all 3
3 project management synthesis 0/3 1/3 1.3 min 0.3 min $0.0017 $0.0003 both: workstreams 6/6 every run, but only 2-3/6 risks recalled (multi-week risks missed)

p1_testwrite / p1_refactor failures are correlated with starter-codebase task-design issue; see microbench findings doc § "Test-writing and refactoring task-design issue" before drawing model-quality conclusions from these rows.

†† All 3 27B doc-synthesis runs captured all 8 planted facts but couldn't trim to the 700-word limit. 2 of 3 (v2, v3) hit identical-call-loop on the same brief.md content for 50-130+ iters and were manually advanced to keep the chain moving. Pattern is a documented 27B failure shape, not a transient bug.

‡ 27B doc-synthesis median wall is dominated by the wall-killed v2/v3 runs (32.7 min, $0.043). The cleanly-completed v1 was 8 min / $0.011.

★ Inversion vs the prior expectation in the findings doc: 27B can drive sustained internet-research workflows that Coder-Next doesn't. Citation-validity pass (18 of 33 URLs from p3_market_27b_v1 validated on 2026-04-28): 9 strong-valid (factual claim exactly matches live page), 3 partial-valid (claim mostly right with minor specificity issues), 2 confirmed-wrong URLs (404), 4 inaccessible to the validator. Of 14 testable URLs, 12 (86%) are mostly-valid and 9 (64%) are strict-valid. Measured citations_valid_pct = 75 (was 90 estimate). fabricated_stats_count = 0 — every checkable factual claim (prices, certifications, products) matched live data. Critical observation: the error mode is URL drift (wrong or dead URLs cited), not fabricated facts — a meaningfully different failure shape than the dreamserver-1-pr-audit Coder-Next variance that fabricated technical evidence with confident citations.

Headline reads from this table (post 27B Phase 3 completion):

  • 27B is reliable on tight-schema tasks. Phase 2's 12 programmatic-graded runs: 12/12 PASS. The "27B doesn't ship" framing from the dreamserver-PR-audit benchmark was task-class-specific — when the deliverable is a constrained-shape JSON or markdown-with-clear-keys, 27B ships cleanly.
  • 27B has a documented word-limit-trim failure mode. Doc-synthesis: 8/8 planted facts captured every single run, but 0/3 PASS because the model cannot reliably compress to a tight word limit. 2 of 3 runs entered identical-call-loops trying. Coder-Next handled this better (2/3 PASS).
  • Big inversion on market research. 27B was 3/3 STRUCTURAL_PASS (5-product evaluations, 12-18 inline cites, 29-33 distinct URLs); Coder-Next was 0/3 STRUCTURAL_FAIL. Internet-research workflows aren't hopeless for local models — they're a 27B strength, just not Coder-Next's. (Citation validity is hand-grading placeholder; this is structural completion only.)
  • Coder-Next has a real hallucination-resistance gap. Adversarial-hallucination: 27B 3/3 100% accurate / 0 dangerous; Coder-Next 1/3 with the one ship-attempt landing 2 confirmed-fabrications-as-real (right at the safety threshold). Same failure shape as the documented dreamserver-1-pr-audit Coder-Next variance.
  • Cost-per-attempt: Coder-Next is 4-12× cheaper when it ships. When it doesn't ship (stuck-detector cases), it spends 25+ minutes and ~$0.03 producing nothing, which inverts the economics for hallucination-resistance-required tasks.
  • Both miss multi-week risks on PM-synthesis. Project management: workstream + decision recall is excellent (6/6 + 3-4/4 every run for both models), but risks 2-3/6 across all runs and both models — multi-week-spanning risks systematically dropped.

Update (2026-05-02): microbench-phase-b-2026-05-02 bumps the four highest-signal cells of this table to N=10 with proper Wilson 95% CIs, and adds 27B-no-think as a third arm across the full 12-family grid. Several N=3 hints from the table above are now bounded — see § "microbench-phase-b-2026-05-02" below.


microbench-phase-b-2026-05-02 (N=10 expansion + 27B-no-think third arm)

Bumps the 4 differential cells from N=3 → N=10 and adds 27B-no-think across all 12 families. ~240 runs total. See benchmarks/microbench-phase-b-2026-05-02/findings.md for full breakdown.

Headline ship rates (done_signal — not PASS rate; PASS pending grader sweep)

Model Coverage Ship rate Wilson 95% CI
Qwen3-Coder-Next-AWQ 4 cells × N=10 + 8 cells × N=3 = 63 runs 47/63 = 74.6% [62.5%, 83.9%]
Qwen3.6-27B-AWQ (thinking) 4 cells × N=10 + 8 cells × N=3 = 62 runs 46/62 = 74.2% [62.0%, 83.7%]
Qwen3.6-27B-AWQ (no-think) 12 cells × N=10 = 118 graded + 2 op-labeled 113/118 = 95.8% [90.5%, 98.2%]

Integrated decision table — 4 differential cells × 3 models at N=10

Single row per (cell × model) with ship rate, median wall, median cost, $/shipped-run, and primary failure mode. Cost numbers are upper-bound (wall × power.limit at $0.13/kWh).

Cell Model Ship Median wall Median $ $/ship Primary failure mode
p2_hallucination Coder-Next 5/10 422 s $0.0092 $0.032 stuck_no_workspace_change_for_500_iters (5/10)
p2_hallucination 27B (thinking) 7/10 171 s $0.0037 $0.0045 (none on the 7 ships; 3 model_stopped)
p2_hallucination 27B (no-think) 10/10 127 s $0.0023 $0.0023 none
p3_business Coder-Next 10/10 31 s $0.0006 $0.0006 none
p3_business 27B (thinking) 9/10 163 s $0.0035 $0.0039 1 wall_killed_identical_call_loop
p3_business 27B (no-think) 8/10 171 s $0.0031 $0.0536 2 wall_killed_identical_call_loop
p3_doc Coder-Next 10/10 37 s $0.0007 $0.0007 none
p3_doc 27B (thinking) 6/10 1113 s $0.0201 $0.0712 4 wall_killed_identical_call_loop (word-trim)
p3_doc 27B (no-think) 8/10 ★ 144 s $0.0026 $0.0495 2 wall_killed_identical_call_loop (word-trim, halved)
p3_market Coder-Next 0/10 2294 s $0.0435 5 stuck + 4 api_error: HTTP 400 + 1 wall_killed
p3_market 27B (thinking) 8/10 1720 s $0.0330 $0.046 2 api_error: timed out (transient)
p3_market 27B (no-think) 7/10 2277 s $0.0411 $0.049 1 runaway-gen + 2 op-SIGTERM scroll-loop

p3_doc 27B-no-think 8/10 vs 6/10 thinking-mode is the standout finding — disabling thinking halves the word-limit-trim loop rate (4/10 → 2/10).

Reading the table for a deployment decision:

  • Lowest $/ship for a given cell:
    • p2_hallucination → 27B-no-think
    • p3_business → Coder-Next (60-100× cheaper than 27B variants)
    • p3_doc → Coder-Next (70× cheaper than 27B variants)
    • p3_market → 27B-thinking (Coder-Next is unusable; 27B-no-think slightly cheaper but with higher pathology rate)
  • Highest reliability per cell: 27B-no-think on p2_hallucination (10/10), Coder-Next on p3_business/p3_doc (10/10), 27B-thinking on p3_market (8/10).
  • No single model wins all four cells. Mixed-model deployment is justified by this data if you care about either ship rate or $/ship across all four.

Headline reads (updates to the picture above)

  • 27B-no-think is the most reliable shipper of the three on like-for-like cells (86.8% vs 75% vs 62.5%). The pre-Phase-B framing of "27B vs Coder-Next" needs a third arm — for tasks where ship rate matters more than thinking-mode polish, no-think 27B is the operational pick.
  • 27B-no-think rescues p3_doc from the documented 27B word-trim loop (4/10 wall_killed → 2/10 wall_killed).
  • Coder-Next's p3_market 0/3 → 0/10 at N=10 confirmed as a stable failure shape, Wilson 95% [0%, 27.8%]. Coder-Next does not drive internet-research workflows.
  • Coder-Next's p2_hallucination 1/3 PASS → 5/10 stuck at N=10, Wilson 95% [23.7%, 76.3%] — bounded as a real ~50% failure shape, not a 1-of-N flake.
  • Two new pathologies surfaced (now in tooling/FAILURE-TAXONOMY.md):
    • scroll-loop (sub-label of identical-call-loop) — model walks an HTML response in fixed-byte slices; raw command hashes differ so the harness's content-hash same-content guard doesn't fire. Caught in p3_market_27b-nothink_v1 (155 iters) and _v8 (31 iters).
    • runaway-generation (new primary) — single model response exceeds the harness's max-output-tokens budget without stopping. Caught in p3_market_27b-nothink_v5 (137,855 tokens).

Caveats (in addition to those on the original microbench table)

  • Ship rate ≠ PASS rate. PASS-rate analysis pending the batch-grader sweep against the no-think tarballs.
  • Cross-batch comparisons on N=3 P1 cells include harness-drift effects (different file_sha256 between batches). Within the 4 N=10 differential cells, harness is consistent across all three model arms.

What the data supports

Strong claims:

  • Cloud entries (Opus-4.7, GPT-5.5) reliably ship complete deliverables on all three benchmarks. Local 30B-class quantized entries do not.
  • Spec-compliance and verdict-accuracy are different axes. On dreamserver-1-pr-audit, Coder-Next has 100% spec compliance and 33% factual accuracy. 27B has 0% spec compliance (no verdict.md) and 100% factual accuracy in the implicit verdicts present in review.md. From the artifact alone you can't tell which mode you're in for any given Coder-Next run; the wrong runs include fabricated evidence (line citations to non-existent issues, fake test scripts).
  • Cost-per-attempt at N=1: Coder-Next is ~4× cheaper than 27B. For an ensemble-with-verification deployment shape, the economics favor running Coder-Next 3+ times and verifying than running 27B once.
  • 35B-A3B-AWQ at 4-bit is below the floor for these tasks: 0 of 3 wallstreet attempts shipped; 0 of 1 dreamserver-1-pr-audit attempts shipped. Higher-precision quantizations untested.

Weaker / not-yet-supported:

  • "Coder-Next is X% wrong on PR review in general" — current evidence is 2/3 wrong on a single PR. Need more PRs and more N to pin a real rate.
  • "27B is reliably better than Coder-Next for analytical work" — likely true but evidence is qualitative (the 3 hand-written reviews on dreamserver-75-pr-audit/Qwen3.6-27B-AWQ/ are clean; 27B's review.md content on PR #1057 is excellent). Phase 3 hand-grading sharpens this: 27B prose quality 5/5 on doc-synthesis, business-memo bias-pushback 5/5; Coder-Next 4/5 on the same axes.
  • "Cloud models are N× better than local on this benchmark" — categorical gap is clear (cloud ships, local mostly doesn't), but per-claim accuracy for the cloud entries isn't graded with the same methodology used on the local entries.
  • "27B citations on the market-research microbench are valid" — sampled 18 of 33 URLs (~55%) validated. 86% mostly-valid / 64% strict-valid out of 14 testable URLs (4 were inaccessible to the validator from this IP). Measured citations_valid_pct = 75. Important nuance: factual content (prices, certifications) is 100% accurate in the validated sample; the error mode is URL drift, not fabrication. The remaining 15 URLs are unverified — sample is large enough to assert most citations are valid but not "all 33."

Model selection guide

The recommendations below are conditional on the task class this benchmark covers — long-horizon agentic work, structured deliverables, real-world-shaped tasks. They don't speak to interactive chat, single-question Q&A, or coding completion. For those, this benchmark has no signal.

When to use Qwen3.6-27B-AWQ

  • Hallucination resistance is required. The single sharpest local-model superiority signal in this repo: on the adversarial-hallucination microbench (15 issues, 6 real / 9 fabricated, agent must classify), 27B was 3/3 PASS with 100% accuracy and 0 dangerous errors; Coder-Next was 1/3 PASS with 2 confirmed-fabrications-as-real on the one shipping run. For security review, factual research, anything where confidently-wrong is dangerous, 27B is the pick.
  • Internet-research-driven workflows. The second-sharpest local-model superiority signal: market-research microbench saw 27B 3/3 STRUCTURAL_PASS (5 products, 12-18 inline cites to 29-33 distinct URLs) and Coder-Next 0/3 STRUCTURAL_FAIL. 27B drives sustained multi-step research that Coder-Next doesn't. Caveat: STRUCTURAL_PASS only — sample-grade citations rather than consuming blind.
  • Tight-schema structured tasks. 27B was 100% on 20-field extraction across 3 runs, 100% duplicate-cluster recall on triage, 12/12 PASS on Phase 2 programmatic graders. The "27B doesn't ship" framing from the dreamserver-PR-audit benchmark turned out to be task-class-specific (unbounded markdown narrative). When the deliverable shape is constrained, 27B ships cleanly.
  • Human-in-the-loop review work where intermediate analysis matters more than spec-compliant deliverables. 27B's review.md and research/questions.md content was the highest-quality across all six N=1 runs on PR #1057. A reviewer reading that as research notes would get a substantively correct read on the catalog-handling distinction.
  • Tasks where truthfulness matters more than artifact obedience. 27B's failures are usually "didn't finish" or "didn't write the spec-required file"; it doesn't fabricate confident-but-wrong claims with citations.
  • Tasks where you control the downstream consumption. If your pipeline expects markdown notes (not a structured verdict.md JSON), 27B's output is usable as-is.

When to use Qwen3.6-27B-AWQ with --no-think (added in phase-b)

The third arm introduced in microbench-phase-b-2026-05-02 — same base model as 27B-thinking, but with the <think> trace disabled. Has become the recommended default for many cells based on the N=10 expansion data:

  • Default for most non-coding tasks at N=10 ship-rate grain. 27B-no-think hit 95.8% across the full 12-cell grid (Wilson 95% [90.5%, 98.2%]) and 86.8% on the 4 hardest cells. If you're picking a single local model for a bulk run and you're not specifically trying to extract a polished reasoning narrative, no-think 27B ships more reliably than either thinking-mode 27B or Coder-Next.
  • Doc synthesis with a tight word limit. Halves the documented 27B word-trim loop rate (4/10 wall_killed → 2/10 on p3_doc). The mechanism: thinking-mode amplifies "deliberate-without-progressing" loops; no-think writes once and commits.
  • Adversarial hallucination. 27B-no-think 10/10 ship at $0.0023/run — the cleanest result of any arm on this cell. Beats thinking-mode 7/10 and Coder-Next 5/10.
  • Anything where you'd otherwise pick 27B-thinking. Per the pairwise quality study, no-think and thinking are substantively equivalent on hand-graded deliverable correctness — the difference is verbosity of reasoning prose, not output decisions. For decision-making, treat them as one "27B model" with a thinking-flag for prose density.

When to prefer thinking over no-think:

  • When you specifically want dense reasoning prose alongside the deliverable (e.g. for human reviewers tracing the model's logic). No-think output is leaner.
  • On p3_business at the margin (9/10 thinking vs 8/10 no-think — within sampling noise; either works).

Caveats:

  • PASS rate not yet measured on no-think tarballs — the higher ship rate could be paying real PASS rate or could be shipping briefs/memos that don't quite meet spec (e.g. exceeding word limits). Grader sweep pending. This makes the no-think headline provisional on the harder cells.
  • Two new pathologies surfaced during the no-think grid: scroll-loop (model walks an HTML response in fixed-byte slices) and runaway-generation (single response exceeds max-output-tokens budget). Both are caught by tooling/scripts/check_substance.py but not by the harness's own stuck-detector — operator monitoring required on long chains.
  • Untested at dreamserver-scope — the no-think arm hasn't been run against the 1-PR or 75-PR audit. Hypothesized to help with the verdict.md production issue 27B-thinking had on PR #1057, but unmeasured.

When to use Qwen3-Coder-Next-AWQ

  • Pipelines where artifact shape is required and a verifier exists. If your downstream consumes verdict.md/tag/done() and you have a separate check for correctness (a second model, a human pass, regression tests), Coder-Next ships reliably and is ~4× cheaper than 27B.
  • Bounded business-memo, triage, and writing-rewrite tasks. Coder-Next was 3/3 PASS on business-memo (bias-signal recall), 3/3 PASS on triage at 96.7% category accuracy (better than 27B's 86.7%), 2/3 PASS on writing-editing. Below the cost-per-attempt of 27B by 4-12× and competitive on accuracy.
  • Ensemble-with-verification setups. Run Coder-Next 3-5 times on the same input, take majority vote, flag dissent for human review. The variance characteristic is documented; you can build around it.
  • Time-to-output matters. ~3 min/attempt at N=1 vs ~7 min for 27B. If artifact completion + speed beats verdict-accuracy in your loss function, Coder-Next is the pick.

When to avoid both local models

  • Single-shot autonomous high-stakes verdicts (security review, financial recommendations consumed without verification, anything cited downstream). Coder-Next's fabrication risk is documented; 27B's no-ship risk means the verdict you'd cite isn't there in machine-readable form.
  • Long-horizon (>30 min) unattended work. Both models find degenerate failure modes within 30-60 minutes on the 75-PR task. Coder-Next loops; 27B Goodharts the spec or hits the per-response token cap.
  • Internet-research-driven workflows on Coder-Next specifically. Coder-Next was 0/3 STRUCTURAL_FAIL on the market-research microbench (stuck-in-research, api-error). 27B does fine on the same task — see "When to use 27B" above. If you only have Coder-Next, have a human gather sources first.
  • Tasks where fabricated-but-plausible technical claims are dangerous. If a wrong cited line number or invented test would mislead someone with cleanup cost > the win from automation, don't use Coder-Next single-shot. 27B is safer here, but its no-ship failure means the real review work has to be done by hand.

When to use Qwen3.6-35B-A3B-AWQ

  • The current evidence in this repo argues against using it for this task class. 0 of 3 wallstreet attempts shipped; 0 of 1 PR-audit attempts produced any artifact. Higher-precision quantizations might help; 4-bit AWQ at 3B active params doesn't clear the floor.

When to use cloud (Opus-4.7 / GPT-5.5 class)

  • Long-horizon autonomous work where correctness matters and verification budget is limited. Both cloud entries shipped complete deliverables on all three benchmarks. The categorical gap to local is large — we observed local-model fabrication and incomplete deliverables across local entries; cloud entries didn't show those failure shapes.
  • Tasks where artifact completion is non-negotiable. Cloud entries reliably produce the spec-shaped output. Local entries don't.
  • This said, the cloud entries here aren't graded with the same per-claim methodology used on the local entries (see KNOWN-LIMITATIONS § comparison-to-cloud). The cloud-vs-local gap is currently established at the categorical level (shipping vs not), not at the per-claim accuracy level.

What would change this picture

These additions would tighten the recommendations above; until they land, the recommendations are best read as "based on this evidence" rather than "definitive":

  1. Validate the remaining 15 unsampled URLs and the 4 inaccessible-from-this-validator URLs on the 27B market-research entry. Currently 18/33 sampled, measured citations_valid_pct = 75. The 4 inaccessible URLs (PCMag, ZDNet, two LastPass pages) are blocked by Cloudflare from the validator's IP — they could be sampled from a different IP, or the agent's specific cited content could be cross-referenced from archive.org. Would tighten the measured number from 75 (sample) to a fully-measured rate.
  2. Per-claim rubric applied uniformly to cloud entries. Phase 3 hand-grading is now done for the local entries (prose, stance, source skepticism, balance, citations, tone fit, faithfulness, fabrication count); cloud entries (Opus-4.7, GPT-5.5) on the older benchmarks haven't been graded with the same rubric. Would let cloud-vs-local comparisons go beyond "shipping vs not."
  3. Failed-run artifacts published (receipts + transcripts for the 5+ unsuccessful local-model runs not currently in MMBT). Would let a reader see expected failure modes per model.
  4. N=10+ on the highest-signal cells (Coder-Next on dreamserver-1-pr-audit, 27B on the same; both on microbench-2026-04-28/adversarial-hallucination). Would bound the variance the current N=3 only suggests.
  5. Different PR shapes in the dreamserver-1-pr-audit family — the current PR has subtle architectural distinctions; a docs-only PR or a security PR would test different failure modes.
  6. Higher-precision quantizations of the same models (FP8, BF16). Particularly for 35B-A3B which fails at 4-bit; might be a quantization-headroom issue rather than a base-model issue.

None of these are in scope for the current MMBT publication. They're separate experiments.