You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Status:RESEARCH_FIRST. Web verification of Tier A "language-native
idiom canon" sources for the philosophy stage of docs/code-llm.md.
Cross-link: papers/coding-philosophy-sources.md §1.
Method: WebFetch / WebSearch against each candidate. URLs marked
UNVERIFIED could not be confirmed live; new finds marked ★.
Token-estimate convention: rough order of magnitude. Method noted per row.
Word-to-token ratio assumed ~1.3 tok/word (English prose + code blocks).
Documents migration rationale across editions. Did not fetch; mark verify-before-use.
decision points
Clippy lint corpus is the single largest Rust signal and is structured (rule + rationale + example) — perfect for DPO negative/positive pairing into Tier E.
RFCs include rejected/withdrawn ones — filter to merged RFCs only.
The Rust Design FAQ (1.6.0 era) at doc.rust-lang.org/1.6.0/complement-design-faq.html exists but is stale (2015); include as historical only, do not bulk-train on it.
Confirmed via web search. Originally listed as "partial / quotable" — upgrade to full include. Confluence wiki was 403-blocked on fetch but the SEI public statement is unambiguous.
Native BSD-style canonical C-style. Verify exact license on the man page.
decision points
CERT C upgrade from "partial" to "full include w/ attribution" is the single biggest license win in Tier A — it's a large, high-quality, CC-BY corpus.
Kernel + GNU + comp.lang.c FAQ: all quote-only. C's permissively-licensed idiom canon is thinner than Rust/Go/Python — OpenBSD style(9) and CERT C are the load-bearing rows.
No "C book" equivalent of "The Rust Book" exists permissively — K&R is copyrighted Prentice Hall. Accept the gap.
Original list assumed this exists. Web search confirms only a GitHub issue (ziglang/zig#1567) and unaffiliated zenofzig.com. Remove from Tier A or reframe as "Zig design principles extracted from docs."
Mine doc-comment strings from stdlib; canonical-idiom-by-example.
decision points
"Zig zen" entry in the original list is unfounded — there is no official document by that name. Either delete or replace with the Overview + Why-Zig pair.
ziglang.org page licenses are not stated per-page; defer to repo (MIT). Confirm with the Zig team if doing a public release.
The original list says "ANSI rationale notes" — there is no freely-licensed ANSI/ISO SQL rationale document equivalent to C99-Rationale. Drop the row.
Postgres + SQLite together give us a permissive SQL corpus larger than any other language in Tier A. Risk: SQL-dialect skew toward these two engines. Acceptable for native-first prior since we want dialect-specific idiom.
Cross-cutting issues
issue
affects
resolution proposal
Attribution-required licenses (CC-BY-*, PSF, PostgreSQL) need per-chunk provenance
Python, TS, Go, C (CERT), Postgres
Add provenance: front-matter to every training example; pipeline must not strip it.
Quote-only sources (GPL kernel, GFDL GNU, ISO docs, Oracle MySQL) cannot be bulk-included
C, SQL
Build a separate "quote-only excerpts" sub-corpus (≤ ~10% of a doc per fair-use heuristic); train as small-volume signal, not bulk.
"Zig zen" does not exist as an official page
Zig
Replace with learn/overview/ + learn/why_zig_rust_d_cpp/. Update papers/coding-philosophy-sources.md.
tc39/proposals index repo has no root LICENSE; per-proposal repos vary
TypeScript
Auto-scan each tc39/* proposal repo for LICENSE before inclusion; default-exclude on miss.
go.dev wiki license inheritance is implicit, not stated
Go (CodeReviewComments)
Email go-doc team or accept implicit CC-BY-4.0 via parent-domain footer.
PEP corpus is too large if bulk-included
Python
Pre-filter to design/style/process PEPs; skip Standards Track API PEPs unless they include rationale prose.
Translated/non-English idiom canon (open Q in source file §7)
all
Out of scope for v0.0 pass; revisit after English Tier A is locked.
LLM-generated tutorial pollution risk
all (esp. Python, JS/TS)
Strict allowlist of canonical domains (peps.python.org, doc.rust-lang.org, go.dev, ziglang.org, postgresql.org, sqlite.org, kernel.org, sei.cmu.edu) — no medium.com / dev.to / blog mirrors.
Stale official docs (Rust 1.6 Design FAQ)
Rust
Date-gate: any doc that links to a pinned-version URL (e.g. /1.6.0/) is historical-only.
Token roll-up
Order-of-magnitude per-language estimate (full-include rows only, before dedup):
language
est. tokens (full include)
notes
Python
~12M-14M
dominated by HOWTOs + filtered PEP corpus + Google Style Guide
Rust
~3M-6M
clippy lints + RFCs dominate
TypeScript
~250K-350K
handbook + tsconfig only; TC39 proposals license-gated, drop conservative
Go
~600K-1.1M
go.dev blog dominates
C
~500K-800K
CERT C dominates; rest is quote-only
Zig
~400K-800K
stdlib doc-comments dominate
SQL
~4M-6M
Postgres docs dominate
subtotal full-include
~21M-30M tokens
quote-only sub-corpus (kernel, ISO C99, GNU, MySQL etc.)
~500K-1M (post-excerpt)
trained as small signal, not bulk
Tier A grand total
~22M-31M tokens
Reconciliation with the 3B-tok target
docs/code-llm.md §STRUCT says philosophy = ~3B tok. Tier A even with
generous inclusion produces ~22M-31M tokens — roughly 1% of the
target. The 3B-tok number is dominated by:
Tier D (hexa-canon, weight ×10 on a small corpus → effective volume)
Tier B/C (engineering principles + post-mortems) — to be researched next
Repetition / weighted oversampling of Tier A high-signal docs (PEP-20, Go Proverbs, Rust API Guidelines) within the mix
Implication for sourcing: do NOT pad Tier A by lowering the quality bar.
The 3B-tok budget will come from oversampling + Tiers B/C/D, not from
admitting marginal Tier A sources. The native-first prior is taught by
quality and repetition, not by raw token volume.