Scorecard

Single-table comparison across the published entries. The detail lives in entry READMEs and findings docs; this is the synthesis.

For a head-to-head decision between Coder-Next, 27B-thinking, and 27B-no-think organized by task class, see COMPARISON.md. This SCORECARD is the grand summary; COMPARISON is the model-selection synthesis.

Read KNOWN-LIMITATIONS.md before quoting any cell. Several columns are hand-graded against ground truth where it exists, "not graded" where it doesn't. Confidence levels are noted per column.

Cell-name legend (microbench)

The microbench tables below alternate between p-codes (used in receipt names and tooling) and human-readable names (used in folder names). Mapping:

p-code	Human name	What it tests
`p1_bugfix`	bug-fixing	Fix planted bugs in `logalyzer/`
`p1_refactor`	refactoring	Refactor `logalyzer/` per spec
`p1_testwrite`	test-writing	Write tests for `logalyzer/`
`p2_ci`	ci-failure-debugging	Diagnose + fix CI failures in `discountkit/`
`p2_extract`	structured-extraction	20-field JSON extraction from press release
`p2_hallucination`	adversarial-hallucination	Distinguish 6 real bugs from 9 fabricated
`p2_triage`	customer-support-triage	Closed-vocab classification + dup-cluster recall
`p3_business`	business-memo	Bias-detection memo from a deal pack
`p3_doc`	doc-synthesis	700-word brief from 5 source docs
`p3_market`	market-research	5-product comparison with cited live URLs
`p3_pm`	project-management	Workstream + risk synthesis from meeting notes
`p3_writing`	writing-editing	3-audience rewrite of a post-mortem

What the columns mean

Runs published: how many of the model's attempts on this task are represented in MMBT, vs how many were attempted total (in the source bench repo). Cherry-picked-best-of-N is the publishing default for entries where any attempt shipped; the other attempts are described in entry READMEs.
Spec compliance: did the run produce all required artifacts the task spec asked for? Strong evidence — file existence is checkable.
Factual accuracy: for tasks with verifiable ground truth, does the model's verdict match? Strong evidence on dreamserver-1-pr-audit (PR #1057 has a known-correct MERGE per the canonical hand-written review and the Opus-4.7 audit). Not graded on dreamserver-75-pr-audit (would need per-PR ground truth across all 75) or on wallstreet (BUY/HOLD/SELL is opinion, not verifiable as right/wrong without market hindsight).
Fabricated claims: count of hand-graded false-but-confident technical claims in the verdict / review (e.g. citing line numbers for issues that aren't in the diff, asserting behavior the code doesn't have). Strong evidence on dreamserver-1-pr-audit. Not graded elsewhere yet — would need a per-claim rubric pass over hundreds of claims.
Tests actually run: did the agent invoke the upstream test suite during the run? Strong evidence — visible in the transcript.
Wall: median (or single-run) wall time. Hardware-specific (Tower2 — see KNOWN-LIMITATIONS).
Cost (upper): upper-bound USD estimate from cost.json. Assumes the GPU drew at its power.limit for the entire run; real draw is lower. Suggestive only — for ranking, not for absolute economics.
Failure mode: the primary label.json taxonomy entry (tooling/FAILURE-TAXONOMY.md).

dreamserver-75-pr-audit

Audit 75 open PRs in a live repository and produce a traceable maintainer triage repo.

Model	Runs	Spec	Factual accuracy	Fabricated	Tests	Wall	Cost	Failure mode
Opus-4.7 (cloud)	1/1	✓ full	not graded (would need per-PR ground truth across all 75)	not graded	not visible from artifacts	~5 hr	n/a	success-shipped
GPT-5.5 (cloud)	1/1	✓ full + `verify_coverage.py` self-check passes	not graded	not graded	not visible from artifacts	not recorded	n/a	success-shipped
Qwen3.6-27B-AWQ (local)	1/3 published	△ 75/75 verdict.md files but only 3 are real reviews; 72 are template stubs	partial (3 reviewed PRs match ground truth; 72 stubs unverified)	0 in the 3 deep reviews	0	24 min	$0.031	scaffold-and-stop
Qwen3-Coder-Next-AWQ (local)	0/5	✗ no deliverable across 5 attempts	n/a	n/a	n/a	1-42 min	$0.001-$0.054	identical-call-loop, cyclic-name-slop, stuck-in-research

dreamserver-1-pr-audit (PR #1057, known-correct verdict: MERGE)

Same task spec scaled to a single PR. Ground truth on PR #1057 established three independent ways: canonical hand-written review, the actual public diff, Opus-4.7's audit. All three agree: MERGE. The catalog-handling architectural distinction (_handle_model_list vs _handle_model_download) is the trap that separates surface-pattern matchers from architectural readers.

Model	Runs	Spec	Factual accuracy	Fabricated	Tests	Wall	Cost	Failure mode
Qwen3-Coder-Next-AWQ	1/3 published (cherry-picked correct)	✓ 13/13 files, tag, done()	2/3 wrong across the three runs (this entry's run is the 1 correct; v1 and v3 said REJECT incorrectly)	1 in v1, 4 in v3 including a fabricated `test_stderr_truncation.py`	repro script, no execution	3 min	$0.004	success-shipped (cherry-picked)
Qwen3.6-27B-AWQ	1/3 published	✗ 7/13 files; no verdict.md, no tag, no done() in any of 3 runs	3/3 implicit-MERGE-correct (in `review.md`'s Summary of Findings table; never in a `verdict.md`)	0	pytest invoked, 38 tests on both branches	7 min	$0.009	partial-no-spec-output
Qwen3.6-35B-A3B-AWQ	1/1	✗ 0/13 — zero artifacts	n/a (nothing produced)	n/a	pytest run but no artifacts written	1.7 min	$0.002	floor-failure

wallstreet-intern-test

Build a complete investment memo on any publicly traded US company with $1B-$10B market cap. Every number traceable from raw source → model → memo. Recommendations (BUY/HOLD/SELL) are opinion, not graded as right/wrong — the verifiable axes are spec compliance, source traceability, and fabrication count.

Model	Company	Rec	Runs	Spec	Factual accuracy	Fabricated	Wall	Cost	Failure mode
Opus-4.7 (cloud)	Vita Coco (`COCO`)	HOLD ($46 vs $52 spot)	1/1	✓ full memo + machine-readable verification	not graded (opinion)	not graded	not recorded	n/a	success-shipped
GPT-5.5 (cloud)	YETI Holdings (`YETI`)	HOLD ($41)	1/1	✓ full memo + verification + board-deck follow-on	not graded (opinion)	not graded	not recorded	n/a	success-shipped
Qwen3.6-27B-AWQ (local)	GitLab (`GTLB`)	BUY	1/3 published	✓ full memo + 17 KB XLSX	not graded (opinion)	not graded	27 min	$0.032	success-shipped (cherry-picked)
Qwen3-Coder-Next-AWQ (local)	DocuSign (`DOCU`)	BUY	1/3 published	✓ full memo + 10.6 KB XLSX	not graded (opinion). Caveat: this model's PR-audit verdicts were 2/3 wrong with fabricated evidence; same risk likely extends to BUY calls	not graded	11 min	$0.013	success-shipped (cherry-picked)
Qwen3.6-35B-A3B-AWQ (local)	—	—	0/3	✗ no usable deliverable	n/a	n/a	0.2-7 min	$0.0002-$0.0085	floor-failure / api-error / stuck-in-research

microbench-2026-04-28 (12 task families × 2 models × N=3)

Smaller-scope task families than the dreamserver/wallstreet benchmarks above — each task is a 5-30 minute deliverable rather than a multi-hour audit. Phase 1 = coding (programmatic graders). Phase 2 = structured business tasks (programmatic graders). Phase 3 = unbounded business/writing tasks (mix of programmatic + hand-grading placeholders). N=3 per cell. Cherry-picked-best-of-N is not the publishing default here — the table reports N=3 PASS rates so variance is visible. See benchmarks/microbench-2026-04-28/findings.md for cross-cutting analysis. Entries published only for the most signal-rich task families to avoid bloating the repo with 60+ tiny folders; full results table reproduced below for completeness.

Two task-design issues called out separately: p1_testwrite and p1_refactor use a shared starter (logalyzer/) with a known broken import (from collections import Iterable — Python 3.10+ removed this). Both models 0/3 PASS on these — but the failure is fixing-the-starter-vs-task-scope tension, not pure model failure. See findings doc § "Test-writing and refactoring task-design issue".

Phase	Task	27B PASS	Coder-Next PASS	27B median wall	Coder median wall	27B median cost	Coder median cost	Notable
1	bug-fixing (logalyzer)	3/3	2/3	18.0 min	11.5 min	$0.023	$0.015	both ship; coder-v3 killed at iter 540 (post-completion drift)
1	test-writing (logalyzer)	0/3 †	0/3 †	9.6 min	14.0 min	$0.012	$0.018	task-design issue (broken import) — see caveat
1	refactoring (logalyzer)	0/3 †	0/3 †	5.4 min	5.4 min	$0.007	$0.007	task-design issue — see caveat
2	structured extraction	3/3	3/3	1.2 min	0.3 min	$0.0015	$0.0004	27B 100% on 20 fields; coder ~92%
2	CI failure debugging	3/3	3/3	2.1 min	1.2 min	$0.003	$0.0015	both clean; coder cheaper
2	adversarial hallucination	3/3	1/3	3.4 min	25.9 min	$0.004	$0.034	27B 100% / 0 dangerous; coder 2/3 stuck-detector fired, ship-with-2-dangerous-errors
2	customer support triage	3/3	3/3	3.3 min	1.0 min	$0.004	$0.0013	coder 96.7% category, 27B 86.7% (both 100% dup-cluster recall)
3	document synthesis	0/3 ††	2/3	32.7 min ‡	0.6 min	$0.043 ‡	$0.0008	27B 8/8 facts every run but couldn't trim to 700 words (765, 775, 768); 2 of 3 stuck in identical-call-loop trying to trim. coder hit limit 2/3.
3	business memo	2/3	3/3	2.8 min	0.5 min	$0.0037	$0.0007	both 8/8 bias signals every run; 27B v3 hit 708 words (1 over)
3	market research	3/3 ★	0/3	18.9 min	19.1 min	$0.025	$0.025	27B drives the internet-research workflow Coder-Next doesn't. All 3 27B runs evaluated all 5 products with 12-18 inline cites to 29-33 distinct URLs. Coder-Next 0/3 STRUCTURAL_FAIL across all 3 runs.
3	writing/editing (3-audience rewrite)	0/3	2/3	2.8 min	0.4 min	$0.0036	$0.0005	27B 0/3 all single-subdimension fails (customer_email missing required keyword); ceo_brief + legal_summary PASS in all 3
3	project management synthesis	0/3	1/3	1.3 min	0.3 min	$0.0017	$0.0003	both: workstreams 6/6 every run, but only 2-3/6 risks recalled (multi-week risks missed)

† p1_testwrite / p1_refactor failures are correlated with starter-codebase task-design issue; see microbench findings doc § "Test-writing and refactoring task-design issue" before drawing model-quality conclusions from these rows.

†† All 3 27B doc-synthesis runs captured all 8 planted facts but couldn't trim to the 700-word limit. 2 of 3 (v2, v3) hit identical-call-loop on the same brief.md content for 50-130+ iters and were manually advanced to keep the chain moving. Pattern is a documented 27B failure shape, not a transient bug.

‡ 27B doc-synthesis median wall is dominated by the wall-killed v2/v3 runs (32.7 min, $0.043). The cleanly-completed v1 was 8 min / $0.011.

★ Inversion vs the prior expectation in the findings doc: 27B can drive sustained internet-research workflows that Coder-Next doesn't. Citation-validity pass (18 of 33 URLs from p3_market_27b_v1 validated on 2026-04-28): 9 strong-valid (factual claim exactly matches live page), 3 partial-valid (claim mostly right with minor specificity issues), 2 confirmed-wrong URLs (404), 4 inaccessible to the validator. Of 14 testable URLs, 12 (86%) are mostly-valid and 9 (64%) are strict-valid. Measured citations_valid_pct = 75 (was 90 estimate). fabricated_stats_count = 0 — every checkable factual claim (prices, certifications, products) matched live data. Critical observation: the error mode is URL drift (wrong or dead URLs cited), not fabricated facts — a meaningfully different failure shape than the dreamserver-1-pr-audit Coder-Next variance that fabricated technical evidence with confident citations.

Headline reads from this table (post 27B Phase 3 completion):

27B is reliable on tight-schema tasks. Phase 2's 12 programmatic-graded runs: 12/12 PASS. The "27B doesn't ship" framing from the dreamserver-PR-audit benchmark was task-class-specific — when the deliverable is a constrained-shape JSON or markdown-with-clear-keys, 27B ships cleanly.
27B has a documented word-limit-trim failure mode. Doc-synthesis: 8/8 planted facts captured every single run, but 0/3 PASS because the model cannot reliably compress to a tight word limit. 2 of 3 runs entered identical-call-loops trying. Coder-Next handled this better (2/3 PASS).
Big inversion on market research. 27B was 3/3 STRUCTURAL_PASS (5-product evaluations, 12-18 inline cites, 29-33 distinct URLs); Coder-Next was 0/3 STRUCTURAL_FAIL. Internet-research workflows aren't hopeless for local models — they're a 27B strength, just not Coder-Next's. (Citation validity is hand-grading placeholder; this is structural completion only.)
Coder-Next has a real hallucination-resistance gap. Adversarial-hallucination: 27B 3/3 100% accurate / 0 dangerous; Coder-Next 1/3 with the one ship-attempt landing 2 confirmed-fabrications-as-real (right at the safety threshold). Same failure shape as the documented dreamserver-1-pr-audit Coder-Next variance.
Cost-per-attempt: Coder-Next is 4-12× cheaper when it ships. When it doesn't ship (stuck-detector cases), it spends 25+ minutes and ~$0.03 producing nothing, which inverts the economics for hallucination-resistance-required tasks.
Both miss multi-week risks on PM-synthesis. Project management: workstream + decision recall is excellent (6/6 + 3-4/4 every run for both models), but risks 2-3/6 across all runs and both models — multi-week-spanning risks systematically dropped.

Update (2026-05-02): microbench-phase-b-2026-05-02 bumps the four highest-signal cells of this table to N=10 with proper Wilson 95% CIs, and adds 27B-no-think as a third arm across the full 12-family grid. Several N=3 hints from the table above are now bounded — see § "microbench-phase-b-2026-05-02" below.

microbench-phase-b-2026-05-02 (N=10 expansion + 27B-no-think third arm)

Bumps the 4 differential cells from N=3 → N=10 and adds 27B-no-think across all 12 families. ~240 runs total. See benchmarks/microbench-phase-b-2026-05-02/findings.md for full breakdown.

Headline ship rates (done_signal — not PASS rate; PASS pending grader sweep)

Model	Coverage	Ship rate	Wilson 95% CI
Qwen3-Coder-Next-AWQ	4 cells × N=10 + 8 cells × N=3 = 63 runs	47/63 = 74.6%	[62.5%, 83.9%]
Qwen3.6-27B-AWQ (thinking)	4 cells × N=10 + 8 cells × N=3 = 62 runs	46/62 = 74.2%	[62.0%, 83.7%]
Qwen3.6-27B-AWQ (no-think)	12 cells × N=10 = 118 graded + 2 op-labeled	113/118 = 95.8%	[90.5%, 98.2%]

Integrated decision table — 4 differential cells × 3 models at N=10

Single row per (cell × model) with ship rate, median wall, median cost, $/shipped-run, and primary failure mode. Cost numbers are upper-bound (wall × power.limit at $0.13/kWh).

Cell	Model	Ship	Median wall	Median $	$/ship	Primary failure mode
p2_hallucination	Coder-Next	5/10	422 s	$0.0092	$0.032	`stuck_no_workspace_change_for_500_iters` (5/10)
p2_hallucination	27B (thinking)	7/10	171 s	$0.0037	$0.0045	(none on the 7 ships; 3 model_stopped)
p2_hallucination	27B (no-think)	10/10	127 s	$0.0023	$0.0023	none
p3_business	Coder-Next	10/10	31 s	$0.0006	$0.0006	none
p3_business	27B (thinking)	9/10	163 s	$0.0035	$0.0039	1 `wall_killed_identical_call_loop`
p3_business	27B (no-think)	8/10	171 s	$0.0031	$0.0536	2 `wall_killed_identical_call_loop`
p3_doc	Coder-Next	10/10	37 s	$0.0007	$0.0007	none
p3_doc	27B (thinking)	6/10	1113 s	$0.0201	$0.0712	4 `wall_killed_identical_call_loop` (word-trim)
p3_doc	27B (no-think)	8/10 ★	144 s	$0.0026	$0.0495	2 `wall_killed_identical_call_loop` (word-trim, halved)
p3_market	Coder-Next	0/10	2294 s	$0.0435	∞	5 stuck + 4 `api_error: HTTP 400` + 1 wall_killed
p3_market	27B (thinking)	8/10	1720 s	$0.0330	$0.046	2 `api_error: timed out` (transient)
p3_market	27B (no-think)	7/10	2277 s	$0.0411	$0.049	1 runaway-gen + 2 op-SIGTERM scroll-loop

★ p3_doc 27B-no-think 8/10 vs 6/10 thinking-mode is the standout finding — disabling thinking halves the word-limit-trim loop rate (4/10 → 2/10).

Reading the table for a deployment decision:

Lowest $/ship for a given cell:
- p2_hallucination → 27B-no-think
- p3_business → Coder-Next (60-100× cheaper than 27B variants)
- p3_doc → Coder-Next (70× cheaper than 27B variants)
- p3_market → 27B-thinking (Coder-Next is unusable; 27B-no-think slightly cheaper but with higher pathology rate)
Highest reliability per cell: 27B-no-think on p2_hallucination (10/10), Coder-Next on p3_business/p3_doc (10/10), 27B-thinking on p3_market (8/10).
No single model wins all four cells. Mixed-model deployment is justified by this data if you care about either ship rate or $/ship across all four.

Headline reads (updates to the picture above)

27B-no-think is the most reliable shipper of the three on like-for-like cells (86.8% vs 75% vs 62.5%). The pre-Phase-B framing of "27B vs Coder-Next" needs a third arm — for tasks where ship rate matters more than thinking-mode polish, no-think 27B is the operational pick.
27B-no-think rescues p3_doc from the documented 27B word-trim loop (4/10 wall_killed → 2/10 wall_killed).
Coder-Next's p3_market 0/3 → 0/10 at N=10 confirmed as a stable failure shape, Wilson 95% [0%, 27.8%]. Coder-Next does not drive internet-research workflows.
Coder-Next's p2_hallucination 1/3 PASS → 5/10 stuck at N=10, Wilson 95% [23.7%, 76.3%] — bounded as a real ~50% failure shape, not a 1-of-N flake.
Two new pathologies surfaced (now in tooling/FAILURE-TAXONOMY.md):
- scroll-loop (sub-label of identical-call-loop) — model walks an HTML response in fixed-byte slices; raw command hashes differ so the harness's content-hash same-content guard doesn't fire. Caught in p3_market_27b-nothink_v1 (155 iters) and _v8 (31 iters).
- runaway-generation (new primary) — single model response exceeds the harness's max-output-tokens budget without stopping. Caught in p3_market_27b-nothink_v5 (137,855 tokens).

Caveats (in addition to those on the original microbench table)

Ship rate ≠ PASS rate. PASS-rate analysis pending the batch-grader sweep against the no-think tarballs.
Cross-batch comparisons on N=3 P1 cells include harness-drift effects (different file_sha256 between batches). Within the 4 N=10 differential cells, harness is consistent across all three model arms.

What the data supports

Strong claims:

Cloud entries (Opus-4.7, GPT-5.5) reliably ship complete deliverables on all three benchmarks. Local 30B-class quantized entries do not.
Spec-compliance and verdict-accuracy are different axes. On dreamserver-1-pr-audit, Coder-Next has 100% spec compliance and 33% factual accuracy. 27B has 0% spec compliance (no verdict.md) and 100% factual accuracy in the implicit verdicts present in review.md. From the artifact alone you can't tell which mode you're in for any given Coder-Next run; the wrong runs include fabricated evidence (line citations to non-existent issues, fake test scripts).
Cost-per-attempt at N=1: Coder-Next is ~4× cheaper than 27B. For an ensemble-with-verification deployment shape, the economics favor running Coder-Next 3+ times and verifying than running 27B once.
35B-A3B-AWQ at 4-bit is below the floor for these tasks: 0 of 3 wallstreet attempts shipped; 0 of 1 dreamserver-1-pr-audit attempts shipped. Higher-precision quantizations untested.

Weaker / not-yet-supported:

"Coder-Next is X% wrong on PR review in general" — current evidence is 2/3 wrong on a single PR. Need more PRs and more N to pin a real rate.
"27B is reliably better than Coder-Next for analytical work" — likely true but evidence is qualitative (the 3 hand-written reviews on dreamserver-75-pr-audit/Qwen3.6-27B-AWQ/ are clean; 27B's review.md content on PR #1057 is excellent). Phase 3 hand-grading sharpens this: 27B prose quality 5/5 on doc-synthesis, business-memo bias-pushback 5/5; Coder-Next 4/5 on the same axes.
"Cloud models are N× better than local on this benchmark" — categorical gap is clear (cloud ships, local mostly doesn't), but per-claim accuracy for the cloud entries isn't graded with the same methodology used on the local entries.
"27B citations on the market-research microbench are valid" — sampled 18 of 33 URLs (~55%) validated. 86% mostly-valid / 64% strict-valid out of 14 testable URLs (4 were inaccessible to the validator from this IP). Measured citations_valid_pct = 75. Important nuance: factual content (prices, certifications) is 100% accurate in the validated sample; the error mode is URL drift, not fabrication. The remaining 15 URLs are unverified — sample is large enough to assert most citations are valid but not "all 33."

Model selection guide

The recommendations below are conditional on the task class this benchmark covers — long-horizon agentic work, structured deliverables, real-world-shaped tasks. They don't speak to interactive chat, single-question Q&A, or coding completion. For those, this benchmark has no signal.

When to use Qwen3.6-27B-AWQ

Hallucination resistance is required. The single sharpest local-model superiority signal in this repo: on the adversarial-hallucination microbench (15 issues, 6 real / 9 fabricated, agent must classify), 27B was 3/3 PASS with 100% accuracy and 0 dangerous errors; Coder-Next was 1/3 PASS with 2 confirmed-fabrications-as-real on the one shipping run. For security review, factual research, anything where confidently-wrong is dangerous, 27B is the pick.
Internet-research-driven workflows. The second-sharpest local-model superiority signal: market-research microbench saw 27B 3/3 STRUCTURAL_PASS (5 products, 12-18 inline cites to 29-33 distinct URLs) and Coder-Next 0/3 STRUCTURAL_FAIL. 27B drives sustained multi-step research that Coder-Next doesn't. Caveat: STRUCTURAL_PASS only — sample-grade citations rather than consuming blind.
Tight-schema structured tasks. 27B was 100% on 20-field extraction across 3 runs, 100% duplicate-cluster recall on triage, 12/12 PASS on Phase 2 programmatic graders. The "27B doesn't ship" framing from the dreamserver-PR-audit benchmark turned out to be task-class-specific (unbounded markdown narrative). When the deliverable shape is constrained, 27B ships cleanly.
Human-in-the-loop review work where intermediate analysis matters more than spec-compliant deliverables. 27B's review.md and research/questions.md content was the highest-quality across all six N=1 runs on PR #1057. A reviewer reading that as research notes would get a substantively correct read on the catalog-handling distinction.
Tasks where truthfulness matters more than artifact obedience. 27B's failures are usually "didn't finish" or "didn't write the spec-required file"; it doesn't fabricate confident-but-wrong claims with citations.
Tasks where you control the downstream consumption. If your pipeline expects markdown notes (not a structured verdict.md JSON), 27B's output is usable as-is.

When to use Qwen3.6-27B-AWQ with `--no-think` (added in phase-b)

The third arm introduced in microbench-phase-b-2026-05-02 — same base model as 27B-thinking, but with the <think> trace disabled. Has become the recommended default for many cells based on the N=10 expansion data:

Default for most non-coding tasks at N=10 ship-rate grain. 27B-no-think hit 95.8% across the full 12-cell grid (Wilson 95% [90.5%, 98.2%]) and 86.8% on the 4 hardest cells. If you're picking a single local model for a bulk run and you're not specifically trying to extract a polished reasoning narrative, no-think 27B ships more reliably than either thinking-mode 27B or Coder-Next.
Doc synthesis with a tight word limit. Halves the documented 27B word-trim loop rate (4/10 wall_killed → 2/10 on p3_doc). The mechanism: thinking-mode amplifies "deliberate-without-progressing" loops; no-think writes once and commits.
Adversarial hallucination. 27B-no-think 10/10 ship at $0.0023/run — the cleanest result of any arm on this cell. Beats thinking-mode 7/10 and Coder-Next 5/10.
Anything where you'd otherwise pick 27B-thinking. Per the pairwise quality study, no-think and thinking are substantively equivalent on hand-graded deliverable correctness — the difference is verbosity of reasoning prose, not output decisions. For decision-making, treat them as one "27B model" with a thinking-flag for prose density.

When to prefer thinking over no-think:

When you specifically want dense reasoning prose alongside the deliverable (e.g. for human reviewers tracing the model's logic). No-think output is leaner.
On p3_business at the margin (9/10 thinking vs 8/10 no-think — within sampling noise; either works).

Caveats:

PASS rate not yet measured on no-think tarballs — the higher ship rate could be paying real PASS rate or could be shipping briefs/memos that don't quite meet spec (e.g. exceeding word limits). Grader sweep pending. This makes the no-think headline provisional on the harder cells.
Two new pathologies surfaced during the no-think grid: scroll-loop (model walks an HTML response in fixed-byte slices) and runaway-generation (single response exceeds max-output-tokens budget). Both are caught by tooling/scripts/check_substance.py but not by the harness's own stuck-detector — operator monitoring required on long chains.
Untested at dreamserver-scope — the no-think arm hasn't been run against the 1-PR or 75-PR audit. Hypothesized to help with the verdict.md production issue 27B-thinking had on PR #1057, but unmeasured.

When to use Qwen3-Coder-Next-AWQ

Pipelines where artifact shape is required and a verifier exists. If your downstream consumes verdict.md/tag/done() and you have a separate check for correctness (a second model, a human pass, regression tests), Coder-Next ships reliably and is ~4× cheaper than 27B.
Bounded business-memo, triage, and writing-rewrite tasks. Coder-Next was 3/3 PASS on business-memo (bias-signal recall), 3/3 PASS on triage at 96.7% category accuracy (better than 27B's 86.7%), 2/3 PASS on writing-editing. Below the cost-per-attempt of 27B by 4-12× and competitive on accuracy.
Ensemble-with-verification setups. Run Coder-Next 3-5 times on the same input, take majority vote, flag dissent for human review. The variance characteristic is documented; you can build around it.
Time-to-output matters. ~3 min/attempt at N=1 vs ~7 min for 27B. If artifact completion + speed beats verdict-accuracy in your loss function, Coder-Next is the pick.

When to avoid both local models

Single-shot autonomous high-stakes verdicts (security review, financial recommendations consumed without verification, anything cited downstream). Coder-Next's fabrication risk is documented; 27B's no-ship risk means the verdict you'd cite isn't there in machine-readable form.
Long-horizon (>30 min) unattended work. Both models find degenerate failure modes within 30-60 minutes on the 75-PR task. Coder-Next loops; 27B Goodharts the spec or hits the per-response token cap.
Internet-research-driven workflows on Coder-Next specifically. Coder-Next was 0/3 STRUCTURAL_FAIL on the market-research microbench (stuck-in-research, api-error). 27B does fine on the same task — see "When to use 27B" above. If you only have Coder-Next, have a human gather sources first.
Tasks where fabricated-but-plausible technical claims are dangerous. If a wrong cited line number or invented test would mislead someone with cleanup cost > the win from automation, don't use Coder-Next single-shot. 27B is safer here, but its no-ship failure means the real review work has to be done by hand.

When to use Qwen3.6-35B-A3B-AWQ

The current evidence in this repo argues against using it for this task class. 0 of 3 wallstreet attempts shipped; 0 of 1 PR-audit attempts produced any artifact. Higher-precision quantizations might help; 4-bit AWQ at 3B active params doesn't clear the floor.

When to use cloud (Opus-4.7 / GPT-5.5 class)

Long-horizon autonomous work where correctness matters and verification budget is limited. Both cloud entries shipped complete deliverables on all three benchmarks. The categorical gap to local is large — we observed local-model fabrication and incomplete deliverables across local entries; cloud entries didn't show those failure shapes.
Tasks where artifact completion is non-negotiable. Cloud entries reliably produce the spec-shaped output. Local entries don't.
This said, the cloud entries here aren't graded with the same per-claim methodology used on the local entries (see KNOWN-LIMITATIONS § comparison-to-cloud). The cloud-vs-local gap is currently established at the categorical level (shipping vs not), not at the per-claim accuracy level.

What would change this picture

These additions would tighten the recommendations above; until they land, the recommendations are best read as "based on this evidence" rather than "definitive":

Validate the remaining 15 unsampled URLs and the 4 inaccessible-from-this-validator URLs on the 27B market-research entry. Currently 18/33 sampled, measured citations_valid_pct = 75. The 4 inaccessible URLs (PCMag, ZDNet, two LastPass pages) are blocked by Cloudflare from the validator's IP — they could be sampled from a different IP, or the agent's specific cited content could be cross-referenced from archive.org. Would tighten the measured number from 75 (sample) to a fully-measured rate.
Per-claim rubric applied uniformly to cloud entries. Phase 3 hand-grading is now done for the local entries (prose, stance, source skepticism, balance, citations, tone fit, faithfulness, fabrication count); cloud entries (Opus-4.7, GPT-5.5) on the older benchmarks haven't been graded with the same rubric. Would let cloud-vs-local comparisons go beyond "shipping vs not."
Failed-run artifacts published (receipts + transcripts for the 5+ unsuccessful local-model runs not currently in MMBT). Would let a reader see expected failure modes per model.
N=10+ on the highest-signal cells (Coder-Next on dreamserver-1-pr-audit, 27B on the same; both on microbench-2026-04-28/adversarial-hallucination). Would bound the variance the current N=3 only suggests.
Different PR shapes in the dreamserver-1-pr-audit family — the current PR has subtle architectural distinctions; a docs-only PR or a security PR would test different failure modes.
Higher-precision quantizations of the same models (FP8, BF16). Particularly for 35B-A3B which fails at 4-bit; might be a quantization-headroom issue rather than a base-model issue.

None of these are in scope for the current MMBT publication. They're separate experiments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scorecard

Cell-name legend (microbench)

What the columns mean

dreamserver-75-pr-audit

dreamserver-1-pr-audit (PR #1057, known-correct verdict: MERGE)

wallstreet-intern-test

microbench-2026-04-28 (12 task families × 2 models × N=3)

microbench-phase-b-2026-05-02 (N=10 expansion + 27B-no-think third arm)

Headline ship rates (done_signal — not PASS rate; PASS pending grader sweep)

Integrated decision table — 4 differential cells × 3 models at N=10

Headline reads (updates to the picture above)

Caveats (in addition to those on the original microbench table)

What the data supports

Model selection guide

When to use Qwen3.6-27B-AWQ

When to use Qwen3.6-27B-AWQ with `--no-think` (added in phase-b)

When to use Qwen3-Coder-Next-AWQ

When to avoid both local models

When to use Qwen3.6-35B-A3B-AWQ

When to use cloud (Opus-4.7 / GPT-5.5 class)

What would change this picture

FilesExpand file tree

SCORECARD.md

Latest commit

History

SCORECARD.md

File metadata and controls

Scorecard

Cell-name legend (microbench)

What the columns mean

dreamserver-75-pr-audit

dreamserver-1-pr-audit (PR #1057, known-correct verdict: MERGE)

wallstreet-intern-test

microbench-2026-04-28 (12 task families × 2 models × N=3)

microbench-phase-b-2026-05-02 (N=10 expansion + 27B-no-think third arm)

Headline ship rates (done_signal — not PASS rate; PASS pending grader sweep)

Integrated decision table — 4 differential cells × 3 models at N=10

Headline reads (updates to the picture above)

Caveats (in addition to those on the original microbench table)

What the data supports

Model selection guide

When to use Qwen3.6-27B-AWQ

When to use Qwen3.6-27B-AWQ with --no-think (added in phase-b)

When to use Qwen3-Coder-Next-AWQ

When to avoid both local models

When to use Qwen3.6-35B-A3B-AWQ

When to use cloud (Opus-4.7 / GPT-5.5 class)

What would change this picture

When to use Qwen3.6-27B-AWQ with `--no-think` (added in phase-b)