Name	Name	Last commit message	Last commit date
parent directory ..
GPT-5.5	GPT-5.5
Opus-4.7	Opus-4.7
Qwen3-Coder-Next-AWQ	Qwen3-Coder-Next-AWQ
Qwen3.6-27B-AWQ	Qwen3.6-27B-AWQ
README.md	README.md
findings-2026-04-27-local-models.md	findings-2026-04-27-local-models.md

DreamServer 75 PR Audit Benchmark

Comparisons this supports

This benchmark answers "can this model complete a 75-PR audit at all?" It is a categorical-pass-or-fail at multi-hour scope, not a fine-grained model-quality differentiator.

What it does support:

Cloud-vs-local categorical gap: both cloud entries (Opus-4.7, GPT-5.5) ship complete deliverables; both local entries fail in different ways
Long-horizon agentic failure-mode taxonomy — 27B writes 75/75 verdict files but only 3 are real reviews (72 are template stubs); Coder-Next produces no deliverable across 5 attempts (3 distinct degenerate failure modes)
Existence proof that 30B-class quantized local models break at this task scope

What it does NOT support:

Per-PR ground-truth accuracy comparison (would need 75 hand-graded ground-truth verdicts; we don't have them)
Cloud-vs-cloud quality comparison at the per-claim level
Model-vs-model differentiation between Coder-Next and 27B at this scope (both fail; the failure shapes differ but neither ships a real audit)

For the 5-minute model-selection question, see ../../COMPARISON.md. This benchmark contributes the "long-horizon agentic regime" data point.

Prompt

Audit all 75 open pull requests against Light-Heart-Labs/DreamServer and produce a complete triage report with per-PR merge/revise/reject recommendations, tests/proof notes, cross-PR dependency analysis, maintainer strategy, and a git repository as the deliverable.

Why This Is A Messy Benchmark

This task combines:

live GitHub repository triage,
large backlog synthesis,
cross-PR dependency analysis,
code review,
test/proof recording,
repo construction,
strategic maintainer communication.

The output is not just a final answer. The model has to build a durable, navigable audit repository with traceability for every verdict.

Model Entries

Model	Entry	Final Tally
GPT-5.5	`GPT-5.5/`	34 Merge / 40 Revise / 1 Reject
Claude Opus 4.7 (1M context)	`Opus-4.7/`	53 Merge / 6 Revise-small / 1 Revise-arch / 1 Reject / 14 Hold

The two entries use different verdict frameworks. GPT-5.5 uses a 3-bucket Merge/Revise/Reject taxonomy. Opus-4.7 introduces a Hold category for "needs maintainer judgment" decisions, splits Revise into small / missing-tests / architectural, and uses an explicit 5-axis 0-20 risk score documented as an ADR (Opus-4.7/decisions/0001-risk-scoring-methodology.md). Both methodologies are reasonable; the ADR is what makes the Opus-4.7 verdicts comparable on their own terms.

Expected Entry Shape

Each model entry should preserve its own artifact structure. The shared shape is:

report/
prs/pr-{number}/
testing/
analysis/
research/
decisions/
sources.md
tool-log.md

Both current entries follow this shape. The Opus-4.7 entry adds an ACTIONABLE_FINDINGS_INDEX.md at its root for fast scan of line-level issues, mirroring GPT-5.5's similar index file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

DreamServer 75 PR Audit Benchmark

Comparisons this supports

Prompt

Why This Is A Messy Benchmark

Model Entries

Expected Entry Shape

FilesExpand file tree

dreamserver-75-pr-audit

Directory actions

More options

Directory actions

More options

Latest commit

History

dreamserver-75-pr-audit

Folders and files

parent directory

README.md

DreamServer 75 PR Audit Benchmark

Comparisons this supports

Prompt

Why This Is A Messy Benchmark

Model Entries

Expected Entry Shape