docs: validity boundaries, ROADMAP, FIELD-REPORTS, quant-contribution path by Lightheartdevs · Pull Request #15 · Light-Heart-Labs/MMBT-Messy-Model-Bench-Tests

Lightheartdevs · 2026-05-03T13:55:22Z

Summary

Phase 4 of the docs reorganization. Translates the substantive criticism from the recent LocalLLaMA discussion thread into validity-boundary docs and a contribution-pathway scaffold. No changes to data or methodology — purely about making the study's scope explicit and giving external contributors a clear way to fill the gaps it doesn't cover.

README.md + COMPARISON.md — hoisted "operating point" callout (4-bit AWQ Cyankiwi on 2× RTX PRO 6000 Blackwell; other quants/VRAM/hardware not characterized) to entry-point visibility. Stops the "what quant?" and "what about my hardware?" questions before they're asked.
COMPARISON.md — new "What this benchmark doesn't characterize" section before the Drilling-deeper table. Five paragraphs covering: other quants of the same models, other VRAM tiers, other hardware classes (Mac M-series), other languages (Python-only Phase 1), single-rig hardware variance.
KNOWN-LIMITATIONS.md — expanded "Quantization specificity" subsection with a new "Cyankiwi 4-bit AWQ field reports" block. Acknowledges multiple practitioner reports that these specific quants underperform official FP8 / Unsloth UD4 of the same base models. Defends the within-quant comparison without defending absolute capability claims at higher precisions.
microbench-phase-b/findings.md — extended Recommended follow-ups from 3 → 6 items; added FP8 re-run, M-series Mac sibling study, and language-mix expansion.
New `ROADMAP.md` — consolidated open questions and contribution opportunities from across all findings docs. 10 prioritized active follow-ups with [contributor-welcome] flags on the 4 items external contributors can take end-to-end.
tooling/ADDING-A-MODEL.md — added "Two contribution shapes" section clarifying that "same model, different quant" is a valid contribution path, not just adding a wholly new model.
New `FIELD-REPORTS.md` — template for collecting voluntary practitioner reports of model behavior on real workflows. Complements the structured benchmark data with anecdotal-but-specific evidence.

What this PR is NOT: a response to specific commenters, a change to the benchmark data, or a folder restructure. The discourse is ephemeral; the docs are permanent. This translates the signal into validity-boundary docs and roadmap items.

Test plan

Render check on the new "What this benchmark doesn't characterize" section in COMPARISON.md — verify the in-page anchor link from the operating-point callout resolves correctly
Render check on ROADMAP.md and FIELD-REPORTS.md
Verify the cross-references between COMPARISON.md, KNOWN-LIMITATIONS.md, ROADMAP.md, FIELD-REPORTS.md, and ADDING-A-MODEL.md all resolve (CI link-check should report this)
Confirm `check-links` workflow passes with `checked=867 missing=0`
5-minute test (informal): a reader arriving via a public link should hit the operating-point framing within 30 seconds and understand what isn't measured within 90 seconds

🤖 Generated with Claude Code

…on path Phase 4 of docs reorganization — translates the substantive criticism from the LocalLLaMA discussion thread into validity-boundary docs and a contribution-pathway scaffold. No changes to data or methodology; this is purely about making the study's scope explicit and giving external contributors a clear way to fill the gaps it doesn't cover. - README.md + COMPARISON.md: hoisted "operating point" callout (4-bit AWQ Cyankiwi on 2x RTX PRO 6000 Blackwell; other quants/VRAM/hardware not characterized) to entry-point visibility. Stops the "what quant?" and "what about my hardware?" questions before they're asked. - COMPARISON.md: new "What this benchmark doesn't characterize" section before the Drilling-deeper table. Five paragraphs covering: other quants of the same models, other VRAM tiers, other hardware classes (Mac M-series), other languages (Python-only Phase 1), single-rig hardware variance. Pre-empts the most common substantive criticisms by making them part of the doc instead of letting them surface in threads. - KNOWN-LIMITATIONS.md: expanded "Quantization specificity" subsection with a new "Cyankiwi 4-bit AWQ field reports" block. Acknowledges multiple practitioner reports that these specific quants underperform official FP8 / Unsloth UD4 of the same models. Defends the within-quant comparison (still informative) without defending absolute capability claims (not characterized at higher precisions). Commits to FP8 re-run as the validation pass. - microbench-phase-b/findings.md: extended Recommended follow-ups from 3 to 6 items; added FP8 re-run, M-series Mac sibling study, and language-mix expansion. Pointer to the new ROADMAP for the consolidated cross-doc view. - New ROADMAP.md: consolidated open questions and contribution opportunities from across all findings docs. 10 prioritized active follow-ups with [contributor-welcome] flags on the 4 items external contributors can take end-to-end. Replaces the need to read 3 findings docs to know "where can I help?" - tooling/ADDING-A-MODEL.md: added "Two contribution shapes" section at top — clarifies that "same model, different quant" (e.g. official FP8, Unsloth UD4) is a valid contribution path, not just adding a wholly new model. Currently the highest-priority external contribution per ROADMAP. - New FIELD-REPORTS.md: template for collecting voluntary practitioner reports of model behavior on real workflows. Complements the structured benchmark data with anecdotal-but-specific evidence. Initially seeded with the format example; populates as reports come in. What this PR is NOT: a response to specific commenters, a change to the benchmark data, or a folder restructure. The discourse is ephemeral; the docs are permanent. This translates the signal into validity-boundary docs and roadmap items. Broken-link scan: 0/867 (was 0/831 in Phase 3; +36 valid links added). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Lightheartdevs merged commit 6fd1466 into main May 3, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: validity boundaries, ROADMAP, FIELD-REPORTS, quant-contribution path#15

docs: validity boundaries, ROADMAP, FIELD-REPORTS, quant-contribution path#15
Lightheartdevs merged 1 commit into
mainfrom
submit/docs-phase-4-validity-boundaries

Lightheartdevs commented May 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Lightheartdevs commented May 3, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants