Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions COMPARISON.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@
>
> **Last updated**: 2026-05-02 — reflects [`microbench-phase-b-2026-05-02`](benchmarks/microbench-phase-b-2026-05-02/) (N=10 + 27B-no-think third arm). Pre-no-think readers: the picture has shifted.

> **Operating point**: All arms are **Cyankiwi 4-bit AWQ** on **2× RTX PRO 6000 Blackwell at 500 W cap**. Other quants, VRAM tiers, hardware classes, and languages are **not characterized** — see [What this benchmark doesn't characterize](#what-this-benchmark-doesnt-characterize) below. The within-quant comparison here is informative; absolute model capability at higher precisions is a separate question.

## TL;DR

**No model is overall best.** The three arms have orthogonal strengths and statistically indistinguishable headline ship rates (74–96%). Pick by task class:
Expand Down Expand Up @@ -229,6 +231,39 @@ These would tighten the picture:

---

## What this benchmark doesn't characterize

The findings above apply to a single operating point. Outside this point, the picture shifts in ways this study doesn't measure. Each item below is a real follow-up that contributors are welcome to pick up — see [`ROADMAP.md`](ROADMAP.md) for the prioritized list.

### Other quants of the same models

All three arms use **Cyankiwi 4-bit AWQ** community quants. Multiple field reports (see [`KNOWN-LIMITATIONS.md` § Cyankiwi 4-bit AWQ field reports](KNOWN-LIMITATIONS.md#quantization-specificity)) suggest these specific quants underperform the official Qwen FP8 quants and Unsloth UD4 GGUFs of the same base models — describing degraded output coherence and increased loop pathologies on certain task shapes.

What this means for the data here:
- Within-quant comparison (Coder-Next vs 27B at the same Cyankiwi 4-bit AWQ) **is** informative — the differential is a model-behavior gap, not a quant artifact.
- Absolute model capability at higher precision (FP8 / UD4 / BF16) is **not** characterized.
- Effects that depend on a thinking-mechanism (the `--no-think` ship-rate jump, the word-trim loop reduction) are **unlikely to be quant-specific** — they're about the trace, not the weights' precision.

The FP8 re-run is the highest-priority follow-up.

### Other VRAM tiers

Tested at 96 GB-per-GPU. The published vLLM flags (`--max-model-len 262144`, `--gpu-memory-utilization 0.92`) will OOM on consumer 24-48 GB cards. At those tiers the choice isn't "which model wins at 4-bit AWQ" — it's "27B Q8 fits cleanly but Coder-Next needs Q4-with-CPU-offload, which dominates the wall time." That's a different study entirely; this one doesn't address it.

### Other hardware classes

Comparison is **Nvidia/dense-VRAM** operating point. On Mac M-series unified memory the dense-vs-MoE compute tradeoff inverts: 3B-active wins on tokens-per-second (Coder-Next looks much better), full-dense compute is the bottleneck (27B looks much worse). The harness is portable — only the vLLM launch swaps for MLX — so this is a sibling study someone with M-series hardware could run.

### Languages other than Python

Phase 1 coding tasks (`p1_bugfix`, `p1_refactor`, `p1_testwrite`) all use a Python project (`logalyzer`). No C, JavaScript, systems-programming, browser front-end, or low-level work tested. Coder-Next is code-specialized; its relative performance on languages it's tuned harder for is plausibly different from what shows up here. Phase 2 / Phase 3 tasks are mostly business/text and language-agnostic.

### Single-rig hardware

All measurements on one Blackwell rig at 500 W cap. Cross-rig variance not bounded. Power-cap effects are characterized separately in [`hardware-tests/vllm-power-sweep-2026-04-29/`](hardware-tests/vllm-power-sweep-2026-04-29/) but only on this rig.

---

## Drilling deeper

| If you want… | Read |
Expand Down
57 changes: 57 additions & 0 deletions FIELD-REPORTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
# Field reports

Voluntary reports from practitioners running these (or related) models on real workflows. Anecdotal but specific — meant to complement the structured benchmark data with "here's what I actually saw on my hardware on my real task." Not graded; reader-judged.

## How to contribute a report

Open a PR adding an entry to the relevant section below using the template:

```markdown
### YYYY-MM-DD — <one-line summary>

- **Model + quant**: e.g. `Qwen3.6-27B-AWQ (Cyankiwi 4-bit)` or `Qwen3.6-27B (official FP8)`
- **Hardware**: GPU(s), VRAM, key flags (e.g. `--max-model-len`, `--gpu-memory-utilization`)
- **Inference engine**: vLLM / llama.cpp / MLX / etc., version
- **Use case**: one or two sentences on what you were trying to do
- **What you observed**: the concrete behavior — failure mode, success pattern, surprising result
- **Reproducible test case** (optional): if you can share a prompt or a starter, link it. If not, omit this field.
- **Reporter**: GitHub handle or "anonymous"
```

Reports are kept as written (lightly edited for formatting only). If a maintainer adds context or a follow-up note, it goes in a `> Maintainer note:` block beneath the report so the original is preserved.

**What this is for**: surfacing patterns that don't show up in N=10 microbench cells but do show up at scale on real workflows. Examples: quant-specific behavior degradation, language-specific failures, long-horizon failure modes the bench doesn't probe.

**What this isn't**: a leaderboard, a forum, or a venue for unsubstantiated claims. Reports without specifics (model + hardware + observed behavior) are not useful and won't be merged.

---

## Reports

### Template — example entry

> This is the seed entry showing the format. Replace with real reports as they come in.

- **Model + quant**: `Qwen3.6-27B-AWQ` (Cyankiwi 4-bit AWQ)
- **Hardware**: 2× RTX PRO 6000 Blackwell, 96 GB each, 500 W cap
- **Inference engine**: vLLM 0.x.y, `--max-model-len 262144`, `--temperature 0.3`
- **Use case**: 12 task-family microbench (see [`benchmarks/microbench-phase-b-2026-05-02/`](benchmarks/microbench-phase-b-2026-05-02/))
- **What you observed**: 95.8% ship rate at N=10 across all 12 cells with `--no-think`. See `benchmarks/microbench-phase-b-2026-05-02/findings.md` for the structured writeup.
- **Reproducible test case**: see [`tooling/REPRODUCING.md`](tooling/REPRODUCING.md)
- **Reporter**: maintainer (the structured benchmark this complements)

---

## Patterns surfaced from reports

> Maintainer-curated summary of patterns that show up across multiple field reports. Updated as reports accumulate. Currently empty — will populate as reports land.

(no patterns yet — file is brand new as of 2026-05-03)

---

## Related

- Structured benchmark data: [`COMPARISON.md`](COMPARISON.md), [`SCORECARD.md`](SCORECARD.md)
- Open questions and contribution opportunities: [`ROADMAP.md`](ROADMAP.md)
- Failure-mode vocabulary: [`tooling/FAILURE-TAXONOMY.md`](tooling/FAILURE-TAXONOMY.md)
16 changes: 16 additions & 0 deletions KNOWN-LIMITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,22 @@ Cost numbers in `cost.json` are upper-bound estimates (assume the GPU drew at it

The local-model entries used 4-bit AWQ quantizations from the cyankiwi HuggingFace organization. Different quants of the same base model (FP8, BF16, different AWQ tools) will behave differently. The entries pin specific HuggingFace model paths in `launch-commands.md`; respect those when comparing.

#### Cyankiwi 4-bit AWQ field reports

As of 2026-05, multiple practitioners (in independent forum / community discussions) have reported that the Cyankiwi 4-bit AWQ quants of Qwen3.6-27B and Qwen3-Coder-Next underperform the official Qwen FP8 quants and Unsloth UD4 GGUFs of the same base models in their workflows. The reports describe degraded output coherence and increased loop pathologies on certain task shapes.

This benchmark uses Cyankiwi 4-bit AWQ throughout for three reasons:
1. Reproducible community release with stable HuggingFace paths
2. Fits the available VRAM-throughput envelope on Tower2 with room for `--max-model-len 262144` and large concurrency batches in the hardware-tests sweep
3. Consistent across all three model arms (apples-to-apples within the quant)

What this means for the data here:
- **Within-quant comparisons (Coder-Next vs 27B at the same Cyankiwi 4-bit AWQ) remain informative.** Differential behaviors — Coder-Next's `p3_market` 0/10 collapse, 27B's word-trim loop, the `--no-think` ship-rate jump — are model-mechanism findings that are unlikely to disappear at higher precision.
- **Absolute model capability at higher precisions (FP8 / Unsloth UD4 / BF16) is not characterized.** Headline numbers like "27B-no-think 95.8% ship rate" are quant-specific.
- **The ranking of cells where models tie at this quant could shift at FP8.** The both-ship cells (p2_ci, p2_extract, p2_triage) are the most likely to be sensitive.

The FP8 re-run of the same 12-cell grid is the highest-priority follow-up — see [`ROADMAP.md`](ROADMAP.md). Contributors with FP8-capable hardware are welcome to PR results via the [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md) flow (which now explicitly covers the "same model, different quant" contribution path).

### Cloud-LLM hardware is different

Cloud entries (`Opus-4.7/`, `GPT-5.5/`) ran on the providers' inference infrastructure, not Tower2. Cross-comparison should account for that — "the cloud LLM is better" partly reflects "different hardware + different quantization-strategy + different inference engine," not just model differences.
Expand Down
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,10 @@ but I'm making it public so that other people can use it too.
| How to benchmark a new local model | [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md) |
| How to replay a specific past run | [`tooling/REPRODUCING.md`](tooling/REPRODUCING.md) |

## Operating point (read before quoting)

All published runs use **Cyankiwi 4-bit AWQ** quants on **2× RTX PRO 6000 Blackwell at 500 W cap**. Other quants (official FP8, Unsloth UD4 GGUF, BF16), other VRAM tiers (24 GB / 48 GB), other hardware classes (Mac M-series unified memory), and languages other than Python are **not characterized** here. See [`COMPARISON.md` § What this benchmark doesn't characterize](COMPARISON.md#what-this-benchmark-doesnt-characterize) for the full validity-boundary list, and [`ROADMAP.md`](ROADMAP.md) for what's queued to fill those gaps.

## Layout

```text
Expand Down
109 changes: 109 additions & 0 deletions ROADMAP.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Roadmap

> Consolidated list of open questions, validation gaps, and contribution opportunities surfaced across the benchmark entries. Each item links to the source doc that motivated it.
>
> Items marked **[contributor-welcome]** are scoped so that an external contributor with the right hardware can take them end-to-end via the [`tooling/`](tooling/) reproduction pack and submit results as a PR.
>
> **Last reviewed**: 2026-05-03.

## Active follow-ups (in priority order)

### 1. FP8 re-run of the 12-cell microbench grid &nbsp; **[contributor-welcome]**

**Source**: [`KNOWN-LIMITATIONS.md` § Cyankiwi 4-bit AWQ field reports](KNOWN-LIMITATIONS.md#quantization-specificity), [`benchmarks/microbench-phase-b-2026-05-02/findings.md` § Recommended follow-ups](benchmarks/microbench-phase-b-2026-05-02/findings.md#recommended-follow-ups)

Multiple practitioners report that the Cyankiwi 4-bit AWQ quants underperform official Qwen FP8 of the same base models. Re-running the full 12-cell × N=10 grid on FP8 would let current findings generalize across quants or be bounded as quant-specific.

What to do: pull official Qwen FP8 quants, run the 4-command friendly path in [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md) for each model arm, submit a PR with the results.

Hardware: needs FP8-capable GPU. RTX PRO 6000 / H100 / similar.

### 2. PASS-rate grader sweep on the no-think tarballs

**Source**: [`benchmarks/microbench-phase-b-2026-05-02/findings.md`](benchmarks/microbench-phase-b-2026-05-02/findings.md)

Current 95.8% headline for 27B-no-think is `done_signal` rate. Need to run the existing graders against the 120 no-think workspace tarballs to convert ship rate to PASS rate. The `p3_doc` 8/10 ship rate especially could be paying real PASS or could be shipping briefs over the 700-word limit.

Internal — uses the source bench's full transcripts, not just the published representatives.

### 3. M-series Mac sibling study &nbsp; **[contributor-welcome]**

**Source**: [`COMPARISON.md` § Other hardware classes](COMPARISON.md#other-hardware-classes)

The dense-vs-MoE compute tradeoff inverts on unified memory: Coder-Next (3B-active) wins on tokens-per-second; 27B (full-dense compute) becomes the bottleneck. Untested.

What to do: run the same 12 cells on M-series via MLX. Only the vLLM launch commands swap; harness is portable.

Hardware: M-series Mac with ≥48 GB unified memory.

### 4. Language-mix expansion for Phase 1 &nbsp; **[contributor-welcome — task-design first]**

**Source**: [`COMPARISON.md` § Languages other than Python](COMPARISON.md#languages-other-than-python)

Current Phase 1 cells (`p1_bugfix`, `p1_refactor`, `p1_testwrite`) all use a Python project (`logalyzer`). Adding C, JavaScript, or systems-programming starters would test whether Coder-Next's code specialization manifests differently outside Python.

This is task-design work first (find / write a starter project with planted bugs of comparable difficulty), then a benchmarking session. Not a one-command re-run.

### 5. Pairwise quality study extension to the 4 differential cells

**Source**: [`benchmarks/microbench-phase-b-2026-05-02/findings-pairwise-quality-three-model.md`](benchmarks/microbench-phase-b-2026-05-02/findings-pairwise-quality-three-model.md)

The hand-graded quality study covers the 3 both-ship cells (p2_ci, p2_extract, p2_triage). The 4 differential cells (p2_hallucination, p3_business, p3_doc, p3_market) where models ship at different rates haven't been hand-graded for substantive quality of the runs that *did* ship.

### 6. Re-run N=3 P1 cells for 27B-thinking on the current harness

**Source**: [`benchmarks/microbench-phase-b-2026-05-02/findings.md` § Caveats](benchmarks/microbench-phase-b-2026-05-02/findings.md#caveats)

The 27B-thinking 1/9 P1 ship rate may include harness-drift effects (older `file_sha256: 7698067...` vs current `7ea9592...`). Definitively settle whether it's drift or a real model regression.

### 7. Per-claim rubric pass on cloud entries

**Source**: [`KNOWN-LIMITATIONS.md` § Comparison-to-cloud caveats](KNOWN-LIMITATIONS.md#comparison-to-cloud-caveats)

Cloud Opus-4.7 / GPT-5.5 entries weren't graded with the same per-claim rubric used on the local entries. Cloud-vs-local comparison is currently *categorical* only ("cloud ships, local mostly doesn't"), not per-claim accuracy. Building a uniform rubric and applying it to both classes would let head-to-head claims go beyond shipping rates.

### 8. Citation-validity full sweep on `p3_market` 27B

**Source**: [`SCORECARD.md`](SCORECARD.md), [`COMPARISON.md` § What we don't know yet](COMPARISON.md#what-we-dont-know-yet)

Sampled 18/33 URLs (~55%) from one 27B market-research run; measured 75% valid in that sample. Remaining 15 URLs unverified. Full sweep would tighten the citation-validity number from sample to measured.

### 9. 27B-no-think on dreamserver-scope tasks

**Source**: [`COMPARISON.md` § What we don't know yet](COMPARISON.md#what-we-dont-know-yet)

The no-think arm hasn't been run against the 1-PR or 75-PR audits. The substance-monitoring methodology proven on phase-b would transfer; the verdict-production issue 27B-thinking had on PR #1057 *might* improve with no-think — hypothesis only.

### 10. 27B-no-think on the wallstreet investment-memo task

**Source**: [`COMPARISON.md` § What we don't know yet](COMPARISON.md#what-we-dont-know-yet)

Untested. Given the no-think mode's clean shipping on `p3_business` (8/10) and `p3_doc` (8/10), plausible it would handle the multi-section memo cleanly — but unmeasured.

## Welcomed contributions

Beyond the prioritized follow-ups above, contributions in these shapes are explicitly welcome:

- **New model entries** — any vLLM-supported local model with a working tool-call parser. End-to-end walkthrough: [`tooling/ADDING-A-MODEL.md`](tooling/ADDING-A-MODEL.md). Half-day to one-day operator time.
- **Same model, different quant** — official FP8, Unsloth UD4 GGUF, BF16, etc. Same friendly path; only the HuggingFace path + vLLM launch flags change.
- **Field reports** — anecdotal but specific reports of model behavior on real workflows. See [`FIELD-REPORTS.md`](FIELD-REPORTS.md) for the template; one example use case is the Cyankiwi-vs-FP8 quant divergence many practitioners have reported.
- **Methodology improvements** — better grader scripts, additional task families, refined failure-mode taxonomy entries. See [`tooling/FAILURE-TAXONOMY.md`](tooling/FAILURE-TAXONOMY.md) and [`tooling/graders/`](tooling/graders/).
- **Bug reports on harness, graders, or analysis** — open an issue.

## Methodology improvements (longer-horizon)

Items that would require larger structural work, not just a re-run:

- **Per-claim rubric** uniformly applied across local + cloud entries (item 7 above is one cell of this)
- **Larger N (N=30+) on highest-signal cells** to tighten Wilson CIs from "real failure shape" to "bounded rate"
- **More PR shapes for the dreamserver benchmark family** — current 1-PR audit pins to PR #1057 specifically. A docs-only PR, a security PR, and a refactor PR would give different complexity-ceiling data points
- **Higher-precision quantizations of 35B-A3B** — currently fails at 4-bit AWQ; FP8 / BF16 untested
- **Long-horizon agentic improvements** — both local arms find degenerate failure modes within 30-60 min on the 75-PR task. Methodology for keeping local agents productive past 30 min is an open research question

## How to use this doc

If you're picking work, start at the top of "Active follow-ups" — they're prioritized.

If you're contributing externally, look for **[contributor-welcome]** flags. Items 1, 3, and 4 are the highest-leverage external contributions because they unblock validity-boundary claims this benchmark can't make on its own hardware.

If you're maintaining: review this doc when major work lands. Items move from "Active" to "Done" via PRs that link back here; items added from new findings docs should also link back to their source.
Loading
Loading