Skip to content

docs: validity boundaries, ROADMAP, FIELD-REPORTS, quant-contribution path#15

Merged
Lightheartdevs merged 1 commit into
mainfrom
submit/docs-phase-4-validity-boundaries
May 3, 2026
Merged

docs: validity boundaries, ROADMAP, FIELD-REPORTS, quant-contribution path#15
Lightheartdevs merged 1 commit into
mainfrom
submit/docs-phase-4-validity-boundaries

Conversation

@Lightheartdevs

Copy link
Copy Markdown
Contributor

Summary

Phase 4 of the docs reorganization. Translates the substantive criticism from the recent LocalLLaMA discussion thread into validity-boundary docs and a contribution-pathway scaffold. No changes to data or methodology — purely about making the study's scope explicit and giving external contributors a clear way to fill the gaps it doesn't cover.

  • README.md + COMPARISON.md — hoisted "operating point" callout (4-bit AWQ Cyankiwi on 2× RTX PRO 6000 Blackwell; other quants/VRAM/hardware not characterized) to entry-point visibility. Stops the "what quant?" and "what about my hardware?" questions before they're asked.
  • COMPARISON.md — new "What this benchmark doesn't characterize" section before the Drilling-deeper table. Five paragraphs covering: other quants of the same models, other VRAM tiers, other hardware classes (Mac M-series), other languages (Python-only Phase 1), single-rig hardware variance.
  • KNOWN-LIMITATIONS.md — expanded "Quantization specificity" subsection with a new "Cyankiwi 4-bit AWQ field reports" block. Acknowledges multiple practitioner reports that these specific quants underperform official FP8 / Unsloth UD4 of the same base models. Defends the within-quant comparison without defending absolute capability claims at higher precisions.
  • microbench-phase-b/findings.md — extended Recommended follow-ups from 3 → 6 items; added FP8 re-run, M-series Mac sibling study, and language-mix expansion.
  • New `ROADMAP.md` — consolidated open questions and contribution opportunities from across all findings docs. 10 prioritized active follow-ups with [contributor-welcome] flags on the 4 items external contributors can take end-to-end.
  • tooling/ADDING-A-MODEL.md — added "Two contribution shapes" section clarifying that "same model, different quant" is a valid contribution path, not just adding a wholly new model.
  • New `FIELD-REPORTS.md` — template for collecting voluntary practitioner reports of model behavior on real workflows. Complements the structured benchmark data with anecdotal-but-specific evidence.

What this PR is NOT: a response to specific commenters, a change to the benchmark data, or a folder restructure. The discourse is ephemeral; the docs are permanent. This translates the signal into validity-boundary docs and roadmap items.

Test plan

  • Render check on the new "What this benchmark doesn't characterize" section in COMPARISON.md — verify the in-page anchor link from the operating-point callout resolves correctly
  • Render check on ROADMAP.md and FIELD-REPORTS.md
  • Verify the cross-references between COMPARISON.md, KNOWN-LIMITATIONS.md, ROADMAP.md, FIELD-REPORTS.md, and ADDING-A-MODEL.md all resolve (CI link-check should report this)
  • Confirm `check-links` workflow passes with `checked=867 missing=0`
  • 5-minute test (informal): a reader arriving via a public link should hit the operating-point framing within 30 seconds and understand what isn't measured within 90 seconds

🤖 Generated with Claude Code

…on path

Phase 4 of docs reorganization — translates the substantive criticism from
the LocalLLaMA discussion thread into validity-boundary docs and a
contribution-pathway scaffold. No changes to data or methodology; this is
purely about making the study's scope explicit and giving external
contributors a clear way to fill the gaps it doesn't cover.

- README.md + COMPARISON.md: hoisted "operating point" callout (4-bit AWQ
  Cyankiwi on 2x RTX PRO 6000 Blackwell; other quants/VRAM/hardware not
  characterized) to entry-point visibility. Stops the "what quant?" and
  "what about my hardware?" questions before they're asked.

- COMPARISON.md: new "What this benchmark doesn't characterize" section
  before the Drilling-deeper table. Five paragraphs covering: other
  quants of the same models, other VRAM tiers, other hardware classes
  (Mac M-series), other languages (Python-only Phase 1), single-rig
  hardware variance. Pre-empts the most common substantive criticisms
  by making them part of the doc instead of letting them surface in
  threads.

- KNOWN-LIMITATIONS.md: expanded "Quantization specificity" subsection
  with a new "Cyankiwi 4-bit AWQ field reports" block. Acknowledges
  multiple practitioner reports that these specific quants underperform
  official FP8 / Unsloth UD4 of the same models. Defends the within-quant
  comparison (still informative) without defending absolute capability
  claims (not characterized at higher precisions). Commits to FP8 re-run
  as the validation pass.

- microbench-phase-b/findings.md: extended Recommended follow-ups from 3
  to 6 items; added FP8 re-run, M-series Mac sibling study, and
  language-mix expansion. Pointer to the new ROADMAP for the consolidated
  cross-doc view.

- New ROADMAP.md: consolidated open questions and contribution
  opportunities from across all findings docs. 10 prioritized active
  follow-ups with [contributor-welcome] flags on the 4 items external
  contributors can take end-to-end. Replaces the need to read 3 findings
  docs to know "where can I help?"

- tooling/ADDING-A-MODEL.md: added "Two contribution shapes" section at
  top — clarifies that "same model, different quant" (e.g. official FP8,
  Unsloth UD4) is a valid contribution path, not just adding a wholly
  new model. Currently the highest-priority external contribution per
  ROADMAP.

- New FIELD-REPORTS.md: template for collecting voluntary practitioner
  reports of model behavior on real workflows. Complements the
  structured benchmark data with anecdotal-but-specific evidence.
  Initially seeded with the format example; populates as reports come in.

What this PR is NOT: a response to specific commenters, a change to the
benchmark data, or a folder restructure. The discourse is ephemeral; the
docs are permanent. This translates the signal into validity-boundary
docs and roadmap items.

Broken-link scan: 0/867 (was 0/831 in Phase 3; +36 valid links added).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Lightheartdevs Lightheartdevs merged commit 6fd1466 into main May 3, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants