Roadmap

Consolidated list of open questions, validation gaps, and contribution opportunities surfaced across the benchmark entries. Each item links to the source doc that motivated it.

Items marked [contributor-welcome] are scoped so that an external contributor with the right hardware can take them end-to-end via the tooling/ reproduction pack and submit results as a PR.

Last reviewed: 2026-05-03.

Active follow-ups (in priority order)

1. FP8 re-run of the 12-cell microbench grid [contributor-welcome]

Source: KNOWN-LIMITATIONS.md § Cyankiwi 4-bit AWQ field reports, benchmarks/microbench-phase-b-2026-05-02/findings.md § Recommended follow-ups

Multiple practitioners report that the Cyankiwi 4-bit AWQ quants underperform official Qwen FP8 of the same base models. Re-running the full 12-cell × N=10 grid on FP8 would let current findings generalize across quants or be bounded as quant-specific.

What to do: pull official Qwen FP8 quants, run the 4-command friendly path in tooling/ADDING-A-MODEL.md for each model arm, submit a PR with the results.

Hardware: needs FP8-capable GPU. RTX PRO 6000 / H100 / similar.

2. PASS-rate grader sweep on the no-think tarballs

Source: benchmarks/microbench-phase-b-2026-05-02/findings.md

Current 95.8% headline for 27B-no-think is done_signal rate. Need to run the existing graders against the 120 no-think workspace tarballs to convert ship rate to PASS rate. The p3_doc 8/10 ship rate especially could be paying real PASS or could be shipping briefs over the 700-word limit.

Internal — uses the source bench's full transcripts, not just the published representatives.

3. M-series Mac sibling study [contributor-welcome]

Source: COMPARISON.md § Other hardware classes

The dense-vs-MoE compute tradeoff inverts on unified memory: Coder-Next (3B-active) wins on tokens-per-second; 27B (full-dense compute) becomes the bottleneck. Untested.

What to do: run the same 12 cells on M-series via MLX. Only the vLLM launch commands swap; harness is portable.

Hardware: M-series Mac with ≥48 GB unified memory.

4. Language-mix expansion for Phase 1 [contributor-welcome — task-design first]

Source: COMPARISON.md § Languages other than Python

Current Phase 1 cells (p1_bugfix, p1_refactor, p1_testwrite) all use a Python project (logalyzer). Adding C, JavaScript, or systems-programming starters would test whether Coder-Next's code specialization manifests differently outside Python.

This is task-design work first (find / write a starter project with planted bugs of comparable difficulty), then a benchmarking session. Not a one-command re-run.

5. Pairwise quality study extension to the 4 differential cells

Source: benchmarks/microbench-phase-b-2026-05-02/findings-pairwise-quality-three-model.md

The hand-graded quality study covers the 3 both-ship cells (p2_ci, p2_extract, p2_triage). The 4 differential cells (p2_hallucination, p3_business, p3_doc, p3_market) where models ship at different rates haven't been hand-graded for substantive quality of the runs that did ship.

6. Re-run N=3 P1 cells for 27B-thinking on the current harness

Source: benchmarks/microbench-phase-b-2026-05-02/findings.md § Caveats

The 27B-thinking 1/9 P1 ship rate may include harness-drift effects (older file_sha256: 7698067... vs current 7ea9592...). Definitively settle whether it's drift or a real model regression.

7. Per-claim rubric pass on cloud entries

Source: KNOWN-LIMITATIONS.md § Comparison-to-cloud caveats

Cloud Opus-4.7 / GPT-5.5 entries weren't graded with the same per-claim rubric used on the local entries. Cloud-vs-local comparison is currently categorical only ("cloud ships, local mostly doesn't"), not per-claim accuracy. Building a uniform rubric and applying it to both classes would let head-to-head claims go beyond shipping rates.

8. Citation-validity full sweep on `p3_market` 27B

Source: SCORECARD.md, COMPARISON.md § What we don't know yet

Sampled 18/33 URLs (~55%) from one 27B market-research run; measured 75% valid in that sample. Remaining 15 URLs unverified. Full sweep would tighten the citation-validity number from sample to measured.

9. 27B-no-think on dreamserver-scope tasks

Source: COMPARISON.md § What we don't know yet

The no-think arm hasn't been run against the 1-PR or 75-PR audits. The substance-monitoring methodology proven on phase-b would transfer; the verdict-production issue 27B-thinking had on PR #1057 might improve with no-think — hypothesis only.

10. 27B-no-think on the wallstreet investment-memo task

Source: COMPARISON.md § What we don't know yet

Untested. Given the no-think mode's clean shipping on p3_business (8/10) and p3_doc (8/10), plausible it would handle the multi-section memo cleanly — but unmeasured.

Welcomed contributions

Beyond the prioritized follow-ups above, contributions in these shapes are explicitly welcome:

New model entries — any vLLM-supported local model with a working tool-call parser. End-to-end walkthrough: tooling/ADDING-A-MODEL.md. Half-day to one-day operator time.
Same model, different quant — official FP8, Unsloth UD4 GGUF, BF16, etc. Same friendly path; only the HuggingFace path + vLLM launch flags change.
Field reports — anecdotal but specific reports of model behavior on real workflows. See FIELD-REPORTS.md for the template; one example use case is the Cyankiwi-vs-FP8 quant divergence many practitioners have reported.
Methodology improvements — better grader scripts, additional task families, refined failure-mode taxonomy entries. See tooling/FAILURE-TAXONOMY.md and tooling/graders/.
Bug reports on harness, graders, or analysis — open an issue.

Methodology improvements (longer-horizon)

Items that would require larger structural work, not just a re-run:

Per-claim rubric uniformly applied across local + cloud entries (item 7 above is one cell of this)
Larger N (N=30+) on highest-signal cells to tighten Wilson CIs from "real failure shape" to "bounded rate"
More PR shapes for the dreamserver benchmark family — current 1-PR audit pins to PR #1057 specifically. A docs-only PR, a security PR, and a refactor PR would give different complexity-ceiling data points
Higher-precision quantizations of 35B-A3B — currently fails at 4-bit AWQ; FP8 / BF16 untested
Long-horizon agentic improvements — both local arms find degenerate failure modes within 30-60 min on the 75-PR task. Methodology for keeping local agents productive past 30 min is an open research question

How to use this doc

If you're picking work, start at the top of "Active follow-ups" — they're prioritized.

If you're contributing externally, look for [contributor-welcome] flags. Items 1, 3, and 4 are the highest-leverage external contributions because they unblock validity-boundary claims this benchmark can't make on its own hardware.

If you're maintaining: review this doc when major work lands. Items move from "Active" to "Done" via PRs that link back here; items added from new findings docs should also link back to their source.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Roadmap

Active follow-ups (in priority order)

1. FP8 re-run of the 12-cell microbench grid [contributor-welcome]

2. PASS-rate grader sweep on the no-think tarballs

3. M-series Mac sibling study [contributor-welcome]

4. Language-mix expansion for Phase 1 [contributor-welcome — task-design first]

5. Pairwise quality study extension to the 4 differential cells

6. Re-run N=3 P1 cells for 27B-thinking on the current harness

7. Per-claim rubric pass on cloud entries

8. Citation-validity full sweep on `p3_market` 27B

9. 27B-no-think on dreamserver-scope tasks

10. 27B-no-think on the wallstreet investment-memo task

Welcomed contributions

Methodology improvements (longer-horizon)

How to use this doc

Uh oh!

FilesExpand file tree

ROADMAP.md

Latest commit

History

ROADMAP.md

File metadata and controls

Roadmap

Active follow-ups (in priority order)

1. FP8 re-run of the 12-cell microbench grid [contributor-welcome]

2. PASS-rate grader sweep on the no-think tarballs

3. M-series Mac sibling study [contributor-welcome]

4. Language-mix expansion for Phase 1 [contributor-welcome — task-design first]

5. Pairwise quality study extension to the 4 differential cells

6. Re-run N=3 P1 cells for 27B-thinking on the current harness

7. Per-claim rubric pass on cloud entries

8. Citation-validity full sweep on p3_market 27B

9. 27B-no-think on dreamserver-scope tasks

10. 27B-no-think on the wallstreet investment-memo task

Welcomed contributions

Methodology improvements (longer-horizon)

How to use this doc

8. Citation-validity full sweep on `p3_market` 27B