12 task families, 2 local models, N=3 each. Smaller-scope tasks than the dreamserver-PR-audit / wallstreet-intern-test benchmarks above — each task is a 5-30 minute deliverable rather than a multi-hour audit. Designed to surface task-class-specific differences between Qwen3.6-27B-AWQ and Qwen3-Coder-Next-AWQ that the larger benchmarks couldn't isolate.
findings.md— cross-cutting writeup. Headline reads, daily-driver-guide updates, caveats. Read this before drilling into individual task-family folders.../../SCORECARD.md§ "microbench-2026-04-28" — single-table summary with PASS rates, costs, walls.
This benchmark has 12 task families and would produce 60+ per-model folders if every cell were published in full. To avoid bloating the repo, all 12 task families have entries — but published at two levels of detail:
Three highest-signal task families: full per-model entries (with transcripts and complete deliverable bundles):
| Task family | Why it's here | 27B | Coder-Next |
|---|---|---|---|
adversarial-hallucination/ |
Sharpest local-model superiority signal in the entire suite. 27B 100% accurate / 0 dangerous; Coder-Next 1/3 PASS, 2/3 stuck-detector fired, the one shipping run had 2 fabrications-confirmed-as-real. | Qwen3.6-27B-AWQ/ | Qwen3-Coder-Next-AWQ/ |
market-research/ |
Second-sharpest signal — and an inversion of the "both fail at internet research" expectation. 27B 3/3 STRUCTURAL_PASS (5 products, 12-18 inline cites to 29-33 distinct URLs); Coder-Next 0/3 STRUCTURAL_FAIL (no required output files produced). | Qwen3.6-27B-AWQ/ | Qwen3-Coder-Next-AWQ/ |
doc-synthesis/ |
Documented 27B failure mode — 8/8 planted facts captured every run, but 0/3 PASS because model couldn't trim to the 700-word limit. Two of three runs entered identical-call-loops on brief.md. Coder-Next handled the same task-class better (2/3 PASS). |
Qwen3.6-27B-AWQ/ | Qwen3-Coder-Next-AWQ/ |
Nine other task families: lean entries (per-model README + cost.json + grade.json + label.json + summary.json + receipt.json — no transcripts or deliverables to keep the repo size manageable):
| Task family | 27B | Coder-Next | Headline |
|---|---|---|---|
bug-fixing/ |
3/3 PASS | 2/3 PASS | Both can fix planted bugs; 27B more reliable, Coder-Next faster |
test-writing/ |
0/3 † | 0/3 † | Task-design issue (broken starter import) — see findings.md caveat |
refactoring/ |
0/3 † | 0/3 † | Same task-design issue |
structured-extraction/ |
3/3 PASS | 3/3 PASS | 27B 100% on 20 fields; Coder-Next ~92% & 4× cheaper |
ci-failure-debugging/ |
3/3 PASS | 3/3 PASS | Both clean; pick the cheaper |
customer-support-triage/ |
3/3 PASS | 3/3 PASS | Coder-Next 96.7% > 27B 86.7% category accuracy AND faster AND cheaper |
business-memo/ |
2/3 PASS | 3/3 PASS | Both reliable on bias-detection; Coder-Next ships within word limit consistently |
writing-editing/ |
0/3 | 2/3 PASS | 27B's 0/3 is a single-keyword miss on customer email (CEO + legal pass 5/5 every run) |
project-management/ |
0/3 | 1/3 PASS | Both excellent on workstreams + decisions, both 2-3/6 on multi-week risks |
For the 9 lean entries, transcripts and deliverable artifacts are not mirrored in MMBT (to keep repo size manageable) but the task prompts, input starters, ground truth, and grader scripts are all in tooling/ — meaning a reader with the right hardware can rerun any of these task families and produce their own transcripts + deliverables. See tooling/REPRODUCING.md § "Reproducing the microbench" for the walkthrough.
Each model entry under a published task family has:
README.md— entry-specific writeup (this file is the per-model commentary)cost.json— wall, tokens, GPU, energy upper boundgrade.json— programmatic verdict (PASS/FAIL/STRUCTURAL_PASS/etc) with per-dimension scoreslabel.json— failure-mode classification pertooling/FAILURE-TAXONOMY.mdreceipt.json— exact vLLM args, harness git SHA, GPU snapshotsummary.json— finish reason, iteration count, total tokenstranscript.jsonl— full agent loop (model calls + tool calls + results)deliverable/— the actual artifacts the agent produced
Read cost.json + grade.json + label.json first — that's the headline. Then summary.json for run shape. Then drill into deliverable/ for the actual output, and transcript.jsonl for the play-by-play of how it got there.
All runs used:
- vLLM with
--max-model-len 262144,--temperature 0.3, model-specific reasoning/tool parsers - Harness with
--stuck-threshold 500(long-horizon-friendly),--docker-socket --gpus all - Sandbox: bench/agent-pilot/Dockerfile (Python 3.11 + git + curl + standard tools)
- Hardware: Tower2, 2× NVIDIA RTX PRO 6000 Blackwell Workstation Edition, AMD Threadripper PRO 7965WX, 252 GB RAM, 1600W PSU on dedicated 20A circuit. See
tooling/REPRODUCING.mdfor the reproduction walkthrough.
Per-task starter inputs (the planted facts, schemas, audience briefs, etc.) are in this repo's tooling/ folder (see tooling/REPRODUCING.md § "Reproducing the microbench")'s agent-pilot/inputs/phase[2-3]_*/ directories. The ground-truth files (planted answers) live in agent-pilot/graders/ground_truth/ to keep them out of the agent-mountable input dirs — verified pre-launch.
- N=3 per cell. Confidence intervals are wide. The point estimates above are not population estimates.
- Hand-grading placeholders. Phase 3 graders include subjective dimensions (prose quality 1-5, fabrication count, audience-tone fit, citation validity) that are placeholder fields awaiting human grading. The PASS/FAIL above is from the programmatic axis only — the structural pass on
market-researchdoes not assert citation validity. - Task-design issue.
p1_testwriteandp1_refactoruse a shared starter (logalyzer/) with a known broken import (from collections import Iterable— Python 3.10+ removed this). Both models 0/3 PASS on these. The failure is a task-design / fix-the-starter question, not pure model failure. See findings.md § "Test-writing and refactoring task-design issue". - Two manually-advanced 27B runs.
p3_doc_27b_v2andp3_doc_27b_v3were SIGTERM'd mid-run after entering identical-call-loops (50-130+ writes of the samebrief.mdcontent, never able to compress to the 700-word limit). The chain advanced rather than waiting 5+ hours for the stuck-detector at iter 500. The pattern is documented as a 27B failure shape infindings.md, not a transient bug. Manual-advance summaries in those runs'summary.jsonmark themwall_killed_identical_call_loop.