elizaOS
diff --git a/‎.gitmodules‎
Lines changed: 12 additions & 10 deletions b/‎.gitmodules‎
Lines changed: 12 additions & 10 deletions
diff --git a/‎docs/porting/build-matrix.md‎
Lines changed: 14 additions & 9 deletions b/‎docs/porting/build-matrix.md‎
Lines changed: 14 additions & 9 deletions
diff --git a/‎docs/porting/dflash-drafter-strategy.md‎
Lines changed: 4 additions & 2 deletions b/‎docs/porting/dflash-drafter-strategy.md‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎docs/porting/unified-fork-strategy.md‎
Lines changed: 5 additions & 2 deletions b/‎docs/porting/unified-fork-strategy.md‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎docs/porting/upstream-rebase-plan.md‎
Lines changed: 130 additions & 0 deletions b/‎docs/porting/upstream-rebase-plan.md‎
Lines changed: 130 additions & 0 deletions
diff --git a/‎docs/training/optimization-pipeline.md‎
Lines changed: 8 additions & 5 deletions b/‎docs/training/optimization-pipeline.md‎
Lines changed: 8 additions & 5 deletions
diff --git a/‎packages/app-core/scripts/aosp/compile-libllama.mjs‎
Lines changed: 8 additions & 6 deletions b/‎packages/app-core/scripts/aosp/compile-libllama.mjs‎
Lines changed: 8 additions & 6 deletions
diff --git a/‎packages/inference/AGENTS.md‎
Lines changed: 6 additions & 3 deletions b/‎packages/inference/AGENTS.md‎
Lines changed: 6 additions & 3 deletions
diff --git a/‎packages/inference/reports/porting/2026-05-11/e2e-loop-benchmark.md‎
Lines changed: 5 additions & 3 deletions b/‎packages/inference/reports/porting/2026-05-11/e2e-loop-benchmark.md‎
Lines changed: 5 additions & 3 deletions
diff --git a/‎packages/training/pyproject.toml‎
Lines changed: 13 additions & 13 deletions b/‎packages/training/pyproject.toml‎
Lines changed: 13 additions & 13 deletions
@@ -1,14 +1,16 @@
 [submodule "packages/inference/llama.cpp"]
+	# The single canonical llama.cpp checkout for the whole repo. This is the
+	# elizaOS/llama.cpp fork (@ v1.0.0-eliza, commit 08032d57): the unified
+	# fork with the milady kernels (Q4_POLAR / QJL1_256 / TBQ4_0 / TBQ3_0
+	# GGML types + Metal/Vulkan/CUDA kernels) and DFlash spec-decode. The
+	# host build (build-llama-cpp-dflash.mjs) + AOSP cross-compile
+	# (aosp/compile-libllama.mjs) default to this submodule; bun's postinstall
+	# (scripts/ensure-llama-cpp-submodule.mjs) initializes it. The fork is
+	# itself a llama.cpp fork, so it carries convert_hf_to_gguf.py /
+	# llama-quantize / llama-cli too — the training pipeline's plain Q4_K_M
+	# GGUF path uses the fork's tooling (there is no separate "stock upstream"
+	# submodule). build/ is gitignored by llama.cpp's own .gitignore so only
+	# the gitlink (commit SHA) is tracked.
 	path = packages/inference/llama.cpp
 	url = https://github.com/elizaOS/llama.cpp.git
 	branch = eliza/main
-[submodule "packages/training/vendor/llama.cpp"]
-	# Stock upstream llama.cpp pinned to a release tag (b6650). Used by the
-	# training pipeline's plain GGUF Q4_K_M path (convert_hf_to_gguf.py +
-	# llama-quantize + llama-cli). The Milady fork — Q4_POLAR/QJL1_256/TBQ
-	# GGML types — is the *other* submodule (packages/inference/llama.cpp).
-	# scripts/vendor_llama_cpp.sh inits + builds this; build/ is gitignored
-	# by llama.cpp's own .gitignore so only the gitlink (commit SHA) is tracked.
-	path = packages/training/vendor/llama.cpp
-	url = https://github.com/ggml-org/llama.cpp.git
-	shallow = true
 
@@ -3,10 +3,11 @@
 > Per-cell status of every (platform, ABI, GPU-backend) combination
 > that ships a Milady on-device runtime artifact. The unified fork
 > ([`elizaOS/llama.cpp`](https://github.com/elizaOS/llama.cpp) @
-> `v0.3.0-milady`) is the authoritative source; per-cell artifacts
-> live under `~/.eliza/local-inference/bin/<target>/` for host
-> targets and under `apps/app/android/app/src/main/assets/agent/<abi>/`
-> for AOSP. See
+> `v1.0.0-eliza`, commit `08032d57`) is the authoritative source and
+> ships in-tree as the git submodule at `packages/inference/llama.cpp`;
+> per-cell artifacts live under `~/.eliza/local-inference/bin/<target>/`
+> for host targets and under
+> `apps/app/android/app/src/main/assets/agent/<abi>/` for AOSP. See
 > [`docs/porting/unified-fork-strategy.md`](./unified-fork-strategy.md)
 > for the per-technique branching scheme that produces these
 > artifacts,
@@ -35,11 +36,15 @@
 
 The verification commands assume:
 
-- `MILADY_LLAMA_CPP_REMOTE=https://github.com/elizaOS/llama.cpp` and
-  `MILADY_LLAMA_CPP_REF=v0.3.0-milady` (the W3-B fused-CPU release).
-- `~/.cache/milady-llama-cpp/<commit>` is the canonical checkout
-  cache used by `compile-libllama.mjs` (AOSP) and
-  `build-llama-cpp-dflash.mjs` (host).
+- The fork checkout is the in-repo submodule `packages/inference/llama.cpp`
+  (`elizaOS/llama.cpp @ v1.0.0-eliza`, commit `08032d57`) — `bun install`
+  inits it via `scripts/ensure-llama-cpp-submodule.mjs`. Both build scripts
+  (`compile-libllama.mjs` AOSP, `build-llama-cpp-dflash.mjs` host) default to
+  it; `ELIZA_DFLASH_LLAMA_CPP_REMOTE` / `_REF` (or `--cache-dir` / `--src-dir`)
+  force a standalone clone at `~/.cache/eliza-dflash/eliza-llama-cpp` instead.
+- Older example invocations below using a `~/.cache/...llama-cpp-v0.1.0`
+  directory name are illustrative of a standalone-clone layout; the current
+  default is the submodule path above.
 
 Symbols listed under "Expected exported symbols" are the Milady-side
 additions on top of stock llama.cpp; the upstream `llama_*` /
 
@@ -107,8 +107,10 @@ against the new checkpoint and re-stamped (training/AGENTS.md §2).
 
 ## DFlash vs. the Alternatives
 
-The fork (`~/.cache/eliza-dflash/milady-llama-cpp`) exposes several
-speculative paths via `--spec-type`: `draft` (vanilla draft model),
+The fork (the in-repo submodule `packages/inference/llama.cpp`, or the
+standalone clone at `~/.cache/eliza-dflash/eliza-llama-cpp` when the build
+scripts' override forces one) exposes several speculative paths via
+`--spec-type`: `draft` (vanilla draft model),
 `dflash` (the spiritbuun-branded draft path — *functionally identical to
 `draft`*, it just preserves the AOSP CLI spelling; see
 `common/speculative.cpp`), `eagle3`, and a family of `ngram_*` paths
 
@@ -28,9 +28,12 @@ retired. Stock desktop still runs `node-llama-cpp@3.18.1` — that's the remaini
 non-unified consumer; see §F for the migration plan. (`v1.0.0-eliza` is the same
 tree as the prior `v0.4.0-milady` / `v0.2.0-milady`-lineage tags, re-tagged on
 the elizaOS rename. A full rebase onto a recent upstream llama.cpp remains a
-follow-up — the conflict-prone surfaces are the quant-slot enums in
+deferred follow-up — **not** a blocker for any shipping feature; the b8198 base
+already carries `grammar_lazy` / `json_schema` / `response_format` /
+`prefill_assistant`. The conflict-prone surfaces are the quant-slot enums in
 `ggml-common.h` / `ggml.h` and upstream's incompatible redefinition of the
-`Q1_0` block layout.)
+`Q1_0` block layout — see [`upstream-rebase-plan.md`](./upstream-rebase-plan.md)
+for the full cost, conflict surface, trigger conditions, and sequencing.)
 
 **Original problem (resolved for the AOSP+host paths, kept for context):**
 Milady previously built against three different llama.cpp trees and a
 
@@ -0,0 +1,130 @@
+# Upstream rebase plan — `elizaOS/llama.cpp`
+
+> The single page of record for "when do we rebase the fork onto a recent
+> upstream `ggml-org/llama.cpp`, and what does that cost." Pairs with
+> [`unified-fork-strategy.md`](./unified-fork-strategy.md) (which fixes the
+> repo / branching scheme) and [`on-device-quantization-porting-plan.md`](./on-device-quantization-porting-plan.md)
+> (per-technique deliverables). Read this before opening a rebase PR.
+
+## TL;DR
+
+- **Structured output is NOT blocked.** The fork at
+  `elizaOS/llama.cpp @ v1.0.0-eliza` (commit `08032d57`, upstream base
+  `b8198`, ~March 2026) **already carries** `grammar_lazy`, `json_schema`,
+  `response_format`, and `prefill_assistant` in the split `tools/server/`
+  files (`server-task.cpp` / `server-common.cpp` / `server-context.cpp` /
+  `server-http.cpp`). The Eliza-1 structured-output path runs on the
+  current pin — no rebase is required for it. Anything in older docs/comments
+  saying "the fork must be rebased to get the structured-output features" or
+  "the fork is based on old b8198 lacking grammar_lazy" is stale; it
+  predates the `b8198`-based fork.
+- **A rebase onto current upstream IS still a real, deferred effort** — a
+  multi-engineer job with mandatory GPU + Metal hardware verification.
+  It is *not* on the critical path for any shipping Eliza-1 feature. It is
+  worth doing only when (a) there is a concrete upstream feature we want
+  (e.g. a newer quant kernel, a server fix), and (b) GPU/Metal runners are
+  available to re-verify the TurboQuant Q1_0 path on upstream's new block
+  layout.
+- The `v1.0.0-eliza` tag = the kernel-complete `v0.4.0-milady`/`v0.2.0-milady`
+  lineage tree, re-tagged for the org rename. A real newer rebase produces a
+  new `v1.x` tag.
+
+## Why the rebase is hard: the `Q1_0` block collision
+
+The fork composes TurboQuant onto a base where:
+
+- the fork's `block_q1_0` uses `QK1_0 = 32` — the TurboQuant CUDA and Metal
+  kernels (mmq / mmvq / vecdotq / the fused-attn path, plus the milady-kernels
+  `.metal` shaders) are written against the 32-element block;
+- the fork's `block_q1_0_g128` (the 128-grouped variant) is approximately
+  what upstream later shipped as its *new* `Q1_0`.
+
+Upstream `b9106`+ redefined `block_q1_0` with `QK1_0 = 128`. So a rebase is
+not a clean replay — it is a re-port:
+
+1. **Re-port TurboQuant's Q1_0 path onto upstream's 128-block design** and
+   re-verify it on real GPU hardware (CUDA and Metal). The TurboQuant
+   `mmq`/`mmvq`/`vecdotq` kernels and the Metal shaders all assume the
+   32-element layout; moving to 128 changes tiling, packing, and the
+   dequant inner loop. Bit-exact parity vs the reference + a model-backed
+   graph smoke is the acceptance bar — and that requires `nvcc` + an NVIDIA
+   card and an Apple-Silicon Metal box, neither of which CI has on free
+   runners (see `unified-fork-strategy.md` §G).
+2. **Adapt to upstream's `ggml-metal` / `ggml-cuda` restructure.** Upstream
+   has since split `ggml-metal.m` → `ggml-metal*.cpp` and reorganized the
+   `ggml-cuda/` tree; the milady kernels live under
+   `ggml/src/ggml-metal/milady-kernels/` and `ggml/src/ggml-cuda/{qjl,polarquant,turboquant,turbo-tcq}.cu(h)` and the fused-attn `.cu`, all of which have to be re-slotted into the
+   new layout and re-wired into the dispatcher.
+
+## Conflict surface (files that will fight you on rebase)
+
+- `ggml/src/ggml-common.h`, `ggml/include/ggml.h` — the milady quant-slot
+  enums (`TBQ3_0=43`, `TBQ4_0=44`, `QJL1_256=46`, `Q4_POLAR=47`) **and** the
+  `block_q1_0` / `block_q1_0_g128` definitions vs upstream's redefined
+  `Q1_0`. This is the central collision.
+- `ggml/src/ggml-quants.c`, `ggml/src/ggml-quants.h` — quantize/dequantize
+  rows for every milady type + the Q1_0 reference path.
+- `ggml/src/ggml-cuda/{mmq,convert,vecdotq,mmvq,fattn*}.cu(h)` plus the
+  milady CUDA kernels (`qjl.cu`, `polarquant.cu`, `turboquant.cu`,
+  `turbo-tcq.cu`, the fused-attn `.cu`).
+- `ggml/src/ggml-metal/ggml-metal*.cpp` + `ggml/src/ggml-metal/ggml-metal.metal`
+  and the `ggml-metal/milady-kernels/*.metal` shaders + dispatcher entries.
+- `gguf-py/gguf/constants.py` — the GGUF Python type table (`TBQ3_0`,
+  `TBQ4_0`, `QJL1_256`, `Q4_POLAR`) the converter and the `gguf_milady_apply.py`
+  shim grep for.
+- `include/llama.h` — re-exported types + `llama_context_params` (the
+  `flash_attn` bool → `flash_attn_type` enum drift bites the AOSP shim).
+- `tools/quantize/quantize.cpp`, `src/llama-quant.cpp`,
+  `src/llama-model-loader.cpp` — recognizing the new ftype names + loading
+  the milady block layouts.
+- `tools/server/server-{task,common,context,http}.cpp` — the structured-output
+  surface already ported once; an upstream rebase replays it against
+  whatever upstream's server refactor looks like at that point. (Not a
+  blocker — just more diff to reconcile.)
+
+## When to do it
+
+Trigger a rebase only when **both** are true:
+
+1. There is a concrete upstream change we want pulled in (a quant kernel, a
+   server fix, an MXFP4/NVFP4-class addition — see `unified-fork-strategy.md`
+   §E item 1, the only "free on rebase" win), AND
+2. GPU + Metal verification capacity exists (a `cuda-l4` / `rocm-gfx1100` /
+   `apple-m3-pro` runner, or a developer with the hardware) to re-verify the
+   TurboQuant Q1_0 path on the new 128-block layout before merge.
+
+Until then the `b8198`-based pin is the right answer: it carries every
+milady kernel, DFlash spec-decode, and the structured-output server surface,
+and it is hardware-verified at the levels recorded in
+`packages/inference/README.md`.
+
+## Sequencing (when it happens)
+
+1. New branch off `milady/main`; rebase onto the target upstream tag. Take
+   the conflicts in the order of the surface list above (`ggml-common.h` /
+   `ggml.h` first — resolving the `Q1_0` collision unblocks the rest).
+2. Re-port TurboQuant Q1_0 (CPU first, then CUDA, then Metal) onto upstream's
+   128-block layout. CPU parity (scalar + AVX2 + NEON) is the gate before
+   touching GPU.
+3. Re-slot the milady CUDA + Metal kernels into upstream's restructured
+   `ggml-cuda/` and `ggml-metal/` trees; re-wire the dispatcher.
+4. Re-reconcile the structured-output server patch (or confirm upstream now
+   carries it natively and drop our copy).
+5. Run the full CI matrix from `unified-fork-strategy.md` §G **plus** the
+   `kernel-verify-gpu` job. No green-GPU run, no merge.
+6. Tag `v1.x` (the new kernel-complete rebased tree); bump
+   `LLAMA_CPP_TAG`/`LLAMA_CPP_COMMIT`/`REF` in `build-llama-cpp-dflash.mjs`
+   and `compile-libllama.mjs`, the `min_llama_cpp_tag` in the training
+   manifest emitter, and `packages/inference/AGENTS.md` / this doc /
+   `unified-fork-strategy.md`.
+
+## See also
+
+- [`unified-fork-strategy.md`](./unified-fork-strategy.md) §A (current
+  state), §G (CI strategy), §H (migration order).
+- [`on-device-quantization-porting-plan.md`](./on-device-quantization-porting-plan.md)
+  — per-technique × per-platform status.
+- [`packages/inference/AGENTS.md`](../../packages/inference/AGENTS.md) — the
+  inference contract; the fork-source paragraph points here.
+- [`packages/inference/README.md`](../../packages/inference/README.md) —
+  the hardware-verification matrix that gates any kernel claim.
@@ -106,7 +106,6 @@ consumers know the V-cache config falls back to the framework default.
 
 ```bash
 HF_TOKEN=hf_xxx \
-LLAMA_CPP_DIR=$HOME/src/milady-llama.cpp \
 uv run python scripts/optimize_for_milady.py \
     --base-model elizaos/eliza-1-lite-0_6b \
     --output-dir checkpoints/eliza-1-lite \
@@ -120,10 +119,14 @@ Production runs need:
 
 - A GPU with CUDA for the TurboQuant calibration pass and (optionally)
   the QJL CUDA kernel build under `scripts/quantization/qjl/csrc/`.
-- A local checkout of `elizaOS/llama.cpp` at tag `v1.0.0-eliza` (the in-repo submodule `packages/inference/llama.cpp` already provides this)
-  (commit `08032d57e15574f2a7ca19fc3f29510c8673d590`) at
-  `$LLAMA_CPP_DIR`. The fork is the only place `convert_hf_to_gguf.py`
-  understands `--outtype q4_polar`.
+- A checkout of `elizaOS/llama.cpp` at tag `v1.0.0-eliza` (commit
+  `08032d57e15574f2a7ca19fc3f29510c8673d590`). The `packages/inference/llama.cpp`
+  submodule already provides this (`bun install` inits it via
+  `scripts/ensure-llama-cpp-submodule.mjs`), or a standalone clone at
+  `~/.cache/eliza-dflash/eliza-llama-cpp` when the build scripts' override
+  forces one; set `$LLAMA_CPP_DIR` to point at a different checkout. The
+  fork is the only place `convert_hf_to_gguf.py` understands
+  `--outtype q4_polar`.
 - An `HF_TOKEN` (or `HUGGINGFACE_HUB_TOKEN`) with write access to the
   `elizaos` HF org.
 
 
@@ -34,14 +34,16 @@
 //
 // llama.cpp pin (matches plugins/plugin-aosp-local-inference/src/aosp-llama-adapter.ts):
 //   fork:   https://github.com/elizaOS/llama.cpp
-//   tag:    v0.4.0-milady          (milady/integration HEAD)
+//   tag:    v1.0.0-eliza           (the kernel-complete v0.4.0-milady tree,
+//                                   re-tagged on the elizaOS org rename)
 //   commit: 08032d57e15574f2a7ca19fc3f29510c8673d590
 //
-//   v0.4.0-milady adds W4-B CUDA QJL + PolarQuant Q4 + TBQ3_TCQ kernels
-//   on top of v0.3.0-milady. The CUDA paths only matter for the
-//   linux-x64-cuda host target (the AOSP arm64 path stays CPU-only),
-//   but the pin is shared so both AOSP and host build paths land on
-//   identical kernel sources.
+//   This tree adds the W4-B CUDA QJL + PolarQuant Q4 + TBQ3_TCQ kernels
+//   on top of the earlier milady-lineage tags. The CUDA paths only matter
+//   for the linux-x64-cuda host target (the AOSP arm64 path stays
+//   CPU-only), but the pin is shared so both AOSP and host build paths
+//   land on identical kernel sources. A rebase onto a newer upstream is a
+//   deferred effort — see docs/porting/upstream-rebase-plan.md.
 //
 //   v0.2.0-milady (subset of this pin) added DFlash speculative decoding
 //   CLI surface (--spec-type dflash, --draft-min-prob alias, n_drafted_total
 
@@ -25,9 +25,12 @@ upstream b8198. Both build paths consume it: `build-llama-cpp-dflash.mjs`
 the submodule checkout. `ELIZA_DFLASH_LLAMA_CPP_REMOTE` / `_REF` (or `--cache-dir`
 / `--src-dir`) still force a standalone clone for fork bisects. (`v1.0.0-eliza` is
 the same tree as the prior `v0.4.0-milady` tag, re-tagged on the elizaOS rename. A
-full rebase onto a recent upstream llama.cpp remains a follow-up — the
-conflict-prone files are the quant-slot enums in `ggml-common.h` / `ggml.h` and the
-`Q1_0` block layout, which upstream redefined incompatibly with the fork's.)
+full rebase onto a recent upstream llama.cpp remains a **deferred** follow-up — not
+a blocker for structured output (the b8198 base already has `grammar_lazy` /
+`json_schema` / `response_format` / `prefill_assistant`); the conflict-prone files
+are the quant-slot enums in `ggml-common.h` / `ggml.h` and the `Q1_0` block layout,
+which upstream redefined incompatibly with the fork's. Full cost / conflict surface
+/ trigger conditions: [`docs/porting/upstream-rebase-plan.md`](../../docs/porting/upstream-rebase-plan.md).)
 
 ---
 
 
@@ -222,9 +222,11 @@ targets).
   the generated text is off-topic (e.g. LaTeX) — the loop still exercises the
   full decode + DFlash + TTS path correctly; quality is a v2 (fine-tune)
   concern.
-- The DFlash drafter is a real GGUF but ≈ a copy of the target, so acceptance
-  ≈ 1.0; this is the right *shape* but not a meaningful acceptance number until
-  a trained drafter ships.
+- The DFlash drafter is a real GGUF but ≈ a copy of the target. In the e2e
+  bench (short n_predict, in-server DFlash loop) acceptance lands ~0.89–1.0; in
+  the standalone `llama-speculative-simple` eval (`-n 48`, `--draft-min/max
+  2/6`) it lands 0.87 (0.6B) / 0.55 (1.7B) — high-variance numbers off a
+  near-copy drafter, the right *shape* but not a trained-drafter figure.
 - The ASR GGUF is stand-in quality → round-trip WER ≈ 1.0. Recorded honestly.
 - Server peak RSS exceeds the manifest budget on both tiers because the fused
   process keeps every voice region resident — this is a real publish blocker
 
@@ -67,20 +67,20 @@ train = [
   # from ~2k to ~8k on the same 16 GB budget.
   "liger-kernel>=0.5.0",
   # GGUF Q4_K_M quantization (scripts/quantization/gguf-q4_k_m_apply.py).
-  # The wrapper prefers a vendored stock llama.cpp checkout under
-  # packages/training/vendor/llama.cpp (run scripts/vendor_llama_cpp.sh),
-  # but the llama-cpp-python wheel also ships a usable `gguf` python module
-  # so the HF→f16 GGUF convert step works without the vendored checkout.
-  # The Q4_K_M *quantize* step still needs the `llama-quantize` binary the
-  # vendor script builds. NOTE: the custom GGML types Q4_POLAR/QJL1_256/
-  # TurboQuant are NOT in this wheel — those need the elizaOS/llama.cpp
-  # fork via $LLAMA_CPP_DIR.
+  # The wrapper uses the in-repo llama.cpp fork submodule at
+  # packages/inference/llama.cpp (its convert_hf_to_gguf.py + a one-shot
+  # CPU cmake build of llama-quantize/llama-cli — see the script's
+  # _VENDOR_HINT), but the llama-cpp-python wheel also ships a usable
+  # `gguf` python module so the HF→f16 GGUF convert step works even
+  # without that build. The Q4_K_M *quantize* step still needs a real
+  # `llama-quantize` binary. NOTE: the custom GGML types Q4_POLAR/QJL1_256/
+  # TurboQuant are only in the fork — same submodule, or $LLAMA_CPP_DIR.
   "llama-cpp-python>=0.3.0",
-  # convert_hf_to_gguf.py (vendored stock llama.cpp) imports `gguf` AND
-  # `mistral_common` at module load — both must be installed or the
-  # HF→GGUF convert step dies on import. The vendor script also installs
-  # these from requirements/requirements-convert_hf_to_gguf.txt, but pin
-  # them here so `uv run --extra train` works standalone.
+  # convert_hf_to_gguf.py imports `gguf` AND `mistral_common` at module
+  # load — both must be installed or the HF→GGUF convert step dies on
+  # import. The fork's requirements/requirements-convert_hf_to_gguf.txt
+  # also installs these, but pin them here so `uv run --extra train`
+  # works standalone.
   "gguf>=0.10.0",
   "mistral_common>=1.8.3",
   # pytest is consumed by the pre-flight gate (scripts/preflight.sh) and