Phase-by-phase sequencing for the
codeverb. Anchored on decision gates fromplan-decisions-pending.mdand surface coverage fromplan-domain-coverage.md. Every phase exits on a measurable bar.
| field | value |
|---|---|
| status | LIVE |
| current phase | v0.1.0 → v0.1.1 (spec consolidation) |
| status flag | RESEARCH_FIRST — no GPU yet |
| last updated | 2026-05-11 |
| Aspect | State |
|---|---|
| recipe spec | docs/code-llm.md — ~150 lines, partial deltas |
| philosophy sources | papers/coding-philosophy-sources.md — pre-research draft |
| web research | 4/4 tier reports landed (A/B/C/E) |
| pending spec deltas | 5 batched items (D-018..D-022 — see decisions) |
| pending decisions | 12 open (3 critical-path) |
| code (verb tooling) | cli/hexa-forge.hexa selftest only |
| weights / training | 0/2 (RESEARCH_FIRST) |
| corpus on disk | 0 (no fetch yet) |
Two gates dominate the v0.1.x phase:
Gate G-CONSOLIDATE — close domain coverage decisions, batch-apply spec edits.
- inputs: D-014 (DB tier), D-015 (DB placement), D-017 (token target), D-018 (license downgrades), D-019 (URL fixes), D-020 (C+firmware), D-021 (hexa-fidelity), D-022 (5 NL), D-023 (NL diag carve-out)
- output: M-001 batch edit lands on
docs/code-llm.md; M-002 patch lands onpapers/coding-philosophy-sources.md - exit: spec read-through is internally consistent + all tier findings reflected
Gate G-BASE — pick base weights + tokenizer + serving format.
- inputs: D-007 (base), D-008 (tokenizer), D-009 (license confirm), D-010 (eval lineage), D-011 (serving format)
- output: §REQUIRES + §FLOW + §VERIFY frozen for SFT
- exit: ready to enter v0.2.0 infrastructure phase
Gates are sequential — G-CONSOLIDATE first (cheap, paper-only),
G-BASE next (needs sampling experiments).
Goal: every accumulated decision shows up in
docs/code-llm.md. No new research. No corpus fetch yet. Tightening only.
- Apply 5-NL §WHY block (D-022) — add line under core langs:
code-adjacent natural language: English (T0), Korean / Chinese / Russian / Japanese (T1). - Apply NL diagnostic carve-out (D-023) — refusal contract line notes English-only for diagnostics per hexa-lang SPEC §7.
- Apply C + firmware (D-020) —
- §WHY: add "systems & firmware" + "hardware reference literacy" lines
- §STRUCT: add 2 rows (
firmware-native,hexa-firmware) - §EVOLVE: add 4 bench rows (MCU-bench, linker-script literacy, memory-fit, hexa target gate)
- §VERIFY: add 4 tools (
run_size,read_map,read_disasm,read_register)
- Apply hexa-fidelity contract (D-021) — §VERIFY new bullet
enumerating S0–S8 lint,
HX[CCCC]codes,@grace3-field, arena memory model, Z3/CVC5 ban, host-stdlib gate for*-none-*targets. - Apply DB tier scope (D-014 + D-015) —
- §WHY: add "databases & query engines" line
- §STRUCT: add
db-nativerow (T1 full + T2 quote) - §EVOLVE: add Spider/BIRD + EXPLAIN literacy + schema design rows
- §VERIFY: add 4 tools (
run_query,explain_query,apply_migration,read_schema)
- Re-state §STRUCT philosophy tokens (D-017) — replace
~3B tokwith~500M-1B effective (raw ~30M + ×10 weight + ×5-10 synth). Note the synth requirement explicitly. - Patch
papers/coding-philosophy-sources.md(M-002) — apply D-018 license downgrades + D-019 URL fixes; cross-link to tier-*-findings.md as the live detail layer. - Update TODO §1 — add "v0.1.1 consolidation" checklist; remove per-verb open-Qs from TODO since they live in the decision ledger now.
-
docs/code-llm.mdre-read end-to-end is internally consistent - every D-NNN in
plan-decisions-pending.md§3 maps to a paragraph in the recipe OR an entry in the ledger (no orphan decisions) -
papers/coding-philosophy-sources.mdTier A §1 reflects tier-a-findings corrections (Zig zen removed, ANSI SQL row removed, CERT C promoted, Google Style added) - all CC-BY / PSF / PostgreSQL attribution requirements logged for the license-clean CI gate
corpus fetch · base model selection · training stage execution
paper-only, single session. No GPU.
Goal: turn license claims into verified facts on disk. Sample 1-5% of each tier, hash + license-tag, measure real token counts.
- License-clean CI prototype — small Python tool that scans a directory + verifies LICENSE/SPDX header against the allowlist. Fails on GPL/AGPL/SSPL/NC. Hook into a sample fetch.
- Sampling pipeline —
- 5% sample of Stack v2 permissive subset (D-009)
- 100% fetch of small Tier A/B/C sources (PEPs, Rust Book, etc.)
- canon
~/core/hexa-*repo enumerate + token count (Tier D)
- Real tokens — feed sample through candidate tokenizer (Qwen
default). Record per-source token counts → revise
plan-domain-coverage.md §7. - License gate alarms — flag any cell whose claimed license fails the scanner. Resolve before corpus expansion.
- Stack Exchange policy resolution (D-016) — decide pre-2024-07 Academic Torrents mirror vs full exclusion; record verdict.
- real token counts in
plan-domain-coverage.md §7(replace estimates) - license-clean CI passes on the sampled corpus
- D-009 (Stack v2 perm subset) confirmed with measured volume
- D-016 (Stack Exchange) resolved
base model selection · training · full corpus fetch (sample only)
paper + light scripting. CPU + ~100 GB disk for samples. No GPU.
Goal: close the 5 critical-path open Qs. After this, no recipe ambiguity remains for SFT planning.
-
D-007 base weights — benchmark Qwen2.5-Coder-7B vs DeepSeek-Coder-V2-Lite vs StarCoder2-15B on:
- hexa-eval cold (zero hexa training)
- 5-NL multilingual code comprehension
- firmware target literacy (Cortex-M / RISC-V / ESP32)
- DB query understanding (SQL + Cypher + Redis cmd)
- 7B Q5/Q6 fit on M4 Mini 16GB (hardware tier check)
Expected winner: Qwen2.5-Coder-7B (strongest CJK + multilingual
- sized for M4 Mini target).
-
D-008 tokenizer policy — for chosen base, decide:
- extend BPE with hexa keywords (likely)
- extend BPE with
@grace/@verify/@implements/L[*]/HX[CCCC]tokens - retain base multilingual coverage (5 NL)
-
D-011 serving format — sequence: GGUF first (broadest llama.cpp / Ollama / LM Studio reach) → MLX for M-series laptops → vLLM for server. Document conversion path.
-
D-010 eval lineage storage — DuckDB recommended (cross-pattern with hexa-codex sister); decide.
-
D-013 DPO scoring (preview) — sketch tree-sitter rule pack v1 for "idiomatic vs translated" detection; defer full impl to v0.2.0.
- Recipe
Open questions (v0.1.0)section indocs/code-llm.mdreduced to zero items (or moved to v0.2.0 open Qs) - Base weights decision recorded with measured numbers in §4 ledger
- Tokenizer manifest skeleton drafted (extension list, not yet trained)
training run (SFT) · full DPO pipeline impl · synth corpus generation
eval-only, requires running 3 base models locally for comparison. Single H100 or M4 Pro/Studio sufficient for eval. No training.
Existing TODO §2 — now informed by v0.1.x decisions. Mostly engineering, not research.
| infra | notes |
|---|---|
dataset registry (datasets.toml) |
(verb, stage, corpus) keyed; checksum + license tag; attribution metadata required (CC-BY / PSF / PostgreSQL) |
| tokenizer registry | per D-008; base + per-verb extensions versioned together |
| eval lineage (DuckDB per D-010) | store (model SHA, eval set SHA, config, result). Cross-pattern with hexa-codex |
| serving handoff | per D-011; GGUF first, MLX next; documented conversion CLI |
| license-clean CI gate | scanner from v0.1.2 promoted to required check; attribution validator added |
| synth pipeline | principle × idiom + linter-autofix pair generator (~5-10× expansion of Tier A/B sources) |
| DPO data pipeline | linter-driven (clippy/ruff/golangci) + optional CodeReviewer / SE pairs (D-016 outcome) |
| anti-corpus filter | pollution exclusion enforcer (medium/dev.to/hashnode post-2023, geeksforgeeks/quora/blogspot all) |
- all 8 infra components have a working CLI surface
- one verb (code) runs an end-to-end synthetic SFT pass through the pipeline (small subset, no model quality bar yet)
multi-session engineering work. CI + storage budgets relevant. No training compute yet.
Existing TODO §4 unchanged in spirit; gating bars updated with v0.1.x findings.
| Gate | Bar |
|---|---|
| license audit | zero GPL/AGPL/NC/SSPL in pretrain mix (CI green) |
| HumanEval+ | ≥ DeepSeek-Coder-V2-7B |
| hexa-eval | ≥ 80% pass |
| 5-NL eval | ≥ 70% pass cross-lang (EN/KR/CN/RU/JA) |
| DB-eval (Spider/BIRD + EXPLAIN + DDL) | ≥ 60% pass |
| firmware-eval (MCU-bench + linker) | ≥ 50% pass on Cortex-M target |
| frontend-eval (component + a11y + RSC) | ≥ 65% pass (axe 0 violations) |
| safety eval | off-domain refusal ≥ 95% (5-NL × adversarial matrix) |
| hexa-fidelity contract | 100% generated hexa code passes S0–S8 lint |
| handoff | hexa-codex serve E2E accepted |
| reproducibility | pretrain → SFT → DPO on 1× H100 box in ≤ 14 days |
| hardware tier | 7B Q5/Q6 GGUF runs on M4 Mini 16GB (inline use) |
| hexa-codex upstream contribution | ≥ 5 PRs landed in hexa-codex from forge findings; T4 empirical floor delivered for ≥ 2 F-CODEX-N falsifiers (F-CODEX-1 + F-CODEX-2 default targets — measurement window overlaps SFT/inference dev) |
multimodal · prose chat · 14B target (laptop bar is 7B)
requires real training compute. Pre-flight all infra. 8× H100 for SFT window per §REQUIRES.
v0.1.1 spec consolidation
↓ (G-CONSOLIDATE: 9 decisions, paper-only)
v0.1.2 corpus audit + sampling
↓ (D-009 confirmed, D-016 resolved, real tokens measured)
v0.1.3 base model + tokenizer + serving
↓ (G-BASE: 5 decisions, eval-only)
v0.2.0 infrastructure
↓ (8 infra components, engineering, no training)
v1.0.0 first weights
(training compute, 11-gate acceptance)
The earliest training compute is required at v1.0.0. v0.1.x and v0.2.0 are paper / sampling / infra — survivable on a laptop + sample storage.
| item | deferred to | reason |
|---|---|---|
| 14B target weights | v1.1.0+ | M4 Mini 16GB tier is 7B; 14B serves M4 Pro 24GB+ |
| Tier C T3-DB (Timescale/InfluxDB/etc) | v0.2.0 | smaller signal; revisit when ≥3 forge verbs exist |
| Z3 / CVC5 integration | NEVER | hexa-lang SPEC §10.1 — in-house prover only |
| Tracing GC support codegen | NEVER | hexa-lang SPEC §11 — permanent reject |
| LLM-judge synth | NEVER | Shumailov 2024 model-collapse risk |
| Anti-corpus from medium/dev.to post-2023 | NEVER | LLM-slop pollution |
| Korean compiler diagnostics | NEVER | hexa-lang SPEC §7 — Korean i18n permanently closed |
- Decisions ledger:
plan-decisions-pending.md - Coverage matrix:
plan-domain-coverage.md - Tier findings:
tier-a-findings.md·tier-b-findings.md·tier-c-findings.md·tier-e-findings.md - Recipe spec:
../docs/code-llm.md - Existing TODO buckets:
../../LEARNING_PROGRAMMING.md§15 (absorbed TODO.md) — superseded for v0.1.x by this roadmap - hexa-codex sister specs: see
../docs/code-llm.mdCross-link policy