|
| 1 | +# Upstream rebase plan — `elizaOS/llama.cpp` |
| 2 | + |
| 3 | +> The single page of record for "when do we rebase the fork onto a recent |
| 4 | +> upstream `ggml-org/llama.cpp`, and what does that cost." Pairs with |
| 5 | +> [`unified-fork-strategy.md`](./unified-fork-strategy.md) (which fixes the |
| 6 | +> repo / branching scheme) and [`on-device-quantization-porting-plan.md`](./on-device-quantization-porting-plan.md) |
| 7 | +> (per-technique deliverables). Read this before opening a rebase PR. |
| 8 | +
|
| 9 | +## TL;DR |
| 10 | + |
| 11 | +- **Structured output is NOT blocked.** The fork at |
| 12 | + `elizaOS/llama.cpp @ v1.0.0-eliza` (commit `08032d57`, upstream base |
| 13 | + `b8198`, ~March 2026) **already carries** `grammar_lazy`, `json_schema`, |
| 14 | + `response_format`, and `prefill_assistant` in the split `tools/server/` |
| 15 | + files (`server-task.cpp` / `server-common.cpp` / `server-context.cpp` / |
| 16 | + `server-http.cpp`). The Eliza-1 structured-output path runs on the |
| 17 | + current pin — no rebase is required for it. Anything in older docs/comments |
| 18 | + saying "the fork must be rebased to get the structured-output features" or |
| 19 | + "the fork is based on old b8198 lacking grammar_lazy" is stale; it |
| 20 | + predates the `b8198`-based fork. |
| 21 | +- **A rebase onto current upstream IS still a real, deferred effort** — a |
| 22 | + multi-engineer job with mandatory GPU + Metal hardware verification. |
| 23 | + It is *not* on the critical path for any shipping Eliza-1 feature. It is |
| 24 | + worth doing only when (a) there is a concrete upstream feature we want |
| 25 | + (e.g. a newer quant kernel, a server fix), and (b) GPU/Metal runners are |
| 26 | + available to re-verify the TurboQuant Q1_0 path on upstream's new block |
| 27 | + layout. |
| 28 | +- The `v1.0.0-eliza` tag = the kernel-complete `v0.4.0-milady`/`v0.2.0-milady` |
| 29 | + lineage tree, re-tagged for the org rename. A real newer rebase produces a |
| 30 | + new `v1.x` tag. |
| 31 | + |
| 32 | +## Why the rebase is hard: the `Q1_0` block collision |
| 33 | + |
| 34 | +The fork composes TurboQuant onto a base where: |
| 35 | + |
| 36 | +- the fork's `block_q1_0` uses `QK1_0 = 32` — the TurboQuant CUDA and Metal |
| 37 | + kernels (mmq / mmvq / vecdotq / the fused-attn path, plus the milady-kernels |
| 38 | + `.metal` shaders) are written against the 32-element block; |
| 39 | +- the fork's `block_q1_0_g128` (the 128-grouped variant) is approximately |
| 40 | + what upstream later shipped as its *new* `Q1_0`. |
| 41 | + |
| 42 | +Upstream `b9106`+ redefined `block_q1_0` with `QK1_0 = 128`. So a rebase is |
| 43 | +not a clean replay — it is a re-port: |
| 44 | + |
| 45 | +1. **Re-port TurboQuant's Q1_0 path onto upstream's 128-block design** and |
| 46 | + re-verify it on real GPU hardware (CUDA and Metal). The TurboQuant |
| 47 | + `mmq`/`mmvq`/`vecdotq` kernels and the Metal shaders all assume the |
| 48 | + 32-element layout; moving to 128 changes tiling, packing, and the |
| 49 | + dequant inner loop. Bit-exact parity vs the reference + a model-backed |
| 50 | + graph smoke is the acceptance bar — and that requires `nvcc` + an NVIDIA |
| 51 | + card and an Apple-Silicon Metal box, neither of which CI has on free |
| 52 | + runners (see `unified-fork-strategy.md` §G). |
| 53 | +2. **Adapt to upstream's `ggml-metal` / `ggml-cuda` restructure.** Upstream |
| 54 | + has since split `ggml-metal.m` → `ggml-metal*.cpp` and reorganized the |
| 55 | + `ggml-cuda/` tree; the milady kernels live under |
| 56 | + `ggml/src/ggml-metal/milady-kernels/` and `ggml/src/ggml-cuda/{qjl,polarquant,turboquant,turbo-tcq}.cu(h)` and the fused-attn `.cu`, all of which have to be re-slotted into the |
| 57 | + new layout and re-wired into the dispatcher. |
| 58 | + |
| 59 | +## Conflict surface (files that will fight you on rebase) |
| 60 | + |
| 61 | +- `ggml/src/ggml-common.h`, `ggml/include/ggml.h` — the milady quant-slot |
| 62 | + enums (`TBQ3_0=43`, `TBQ4_0=44`, `QJL1_256=46`, `Q4_POLAR=47`) **and** the |
| 63 | + `block_q1_0` / `block_q1_0_g128` definitions vs upstream's redefined |
| 64 | + `Q1_0`. This is the central collision. |
| 65 | +- `ggml/src/ggml-quants.c`, `ggml/src/ggml-quants.h` — quantize/dequantize |
| 66 | + rows for every milady type + the Q1_0 reference path. |
| 67 | +- `ggml/src/ggml-cuda/{mmq,convert,vecdotq,mmvq,fattn*}.cu(h)` plus the |
| 68 | + milady CUDA kernels (`qjl.cu`, `polarquant.cu`, `turboquant.cu`, |
| 69 | + `turbo-tcq.cu`, the fused-attn `.cu`). |
| 70 | +- `ggml/src/ggml-metal/ggml-metal*.cpp` + `ggml/src/ggml-metal/ggml-metal.metal` |
| 71 | + and the `ggml-metal/milady-kernels/*.metal` shaders + dispatcher entries. |
| 72 | +- `gguf-py/gguf/constants.py` — the GGUF Python type table (`TBQ3_0`, |
| 73 | + `TBQ4_0`, `QJL1_256`, `Q4_POLAR`) the converter and the `gguf_milady_apply.py` |
| 74 | + shim grep for. |
| 75 | +- `include/llama.h` — re-exported types + `llama_context_params` (the |
| 76 | + `flash_attn` bool → `flash_attn_type` enum drift bites the AOSP shim). |
| 77 | +- `tools/quantize/quantize.cpp`, `src/llama-quant.cpp`, |
| 78 | + `src/llama-model-loader.cpp` — recognizing the new ftype names + loading |
| 79 | + the milady block layouts. |
| 80 | +- `tools/server/server-{task,common,context,http}.cpp` — the structured-output |
| 81 | + surface already ported once; an upstream rebase replays it against |
| 82 | + whatever upstream's server refactor looks like at that point. (Not a |
| 83 | + blocker — just more diff to reconcile.) |
| 84 | + |
| 85 | +## When to do it |
| 86 | + |
| 87 | +Trigger a rebase only when **both** are true: |
| 88 | + |
| 89 | +1. There is a concrete upstream change we want pulled in (a quant kernel, a |
| 90 | + server fix, an MXFP4/NVFP4-class addition — see `unified-fork-strategy.md` |
| 91 | + §E item 1, the only "free on rebase" win), AND |
| 92 | +2. GPU + Metal verification capacity exists (a `cuda-l4` / `rocm-gfx1100` / |
| 93 | + `apple-m3-pro` runner, or a developer with the hardware) to re-verify the |
| 94 | + TurboQuant Q1_0 path on the new 128-block layout before merge. |
| 95 | + |
| 96 | +Until then the `b8198`-based pin is the right answer: it carries every |
| 97 | +milady kernel, DFlash spec-decode, and the structured-output server surface, |
| 98 | +and it is hardware-verified at the levels recorded in |
| 99 | +`packages/inference/README.md`. |
| 100 | + |
| 101 | +## Sequencing (when it happens) |
| 102 | + |
| 103 | +1. New branch off `milady/main`; rebase onto the target upstream tag. Take |
| 104 | + the conflicts in the order of the surface list above (`ggml-common.h` / |
| 105 | + `ggml.h` first — resolving the `Q1_0` collision unblocks the rest). |
| 106 | +2. Re-port TurboQuant Q1_0 (CPU first, then CUDA, then Metal) onto upstream's |
| 107 | + 128-block layout. CPU parity (scalar + AVX2 + NEON) is the gate before |
| 108 | + touching GPU. |
| 109 | +3. Re-slot the milady CUDA + Metal kernels into upstream's restructured |
| 110 | + `ggml-cuda/` and `ggml-metal/` trees; re-wire the dispatcher. |
| 111 | +4. Re-reconcile the structured-output server patch (or confirm upstream now |
| 112 | + carries it natively and drop our copy). |
| 113 | +5. Run the full CI matrix from `unified-fork-strategy.md` §G **plus** the |
| 114 | + `kernel-verify-gpu` job. No green-GPU run, no merge. |
| 115 | +6. Tag `v1.x` (the new kernel-complete rebased tree); bump |
| 116 | + `LLAMA_CPP_TAG`/`LLAMA_CPP_COMMIT`/`REF` in `build-llama-cpp-dflash.mjs` |
| 117 | + and `compile-libllama.mjs`, the `min_llama_cpp_tag` in the training |
| 118 | + manifest emitter, and `packages/inference/AGENTS.md` / this doc / |
| 119 | + `unified-fork-strategy.md`. |
| 120 | + |
| 121 | +## See also |
| 122 | + |
| 123 | +- [`unified-fork-strategy.md`](./unified-fork-strategy.md) §A (current |
| 124 | + state), §G (CI strategy), §H (migration order). |
| 125 | +- [`on-device-quantization-porting-plan.md`](./on-device-quantization-porting-plan.md) |
| 126 | + — per-technique × per-platform status. |
| 127 | +- [`packages/inference/AGENTS.md`](../../packages/inference/AGENTS.md) — the |
| 128 | + inference contract; the fork-source paragraph points here. |
| 129 | +- [`packages/inference/README.md`](../../packages/inference/README.md) — |
| 130 | + the hardware-verification matrix that gates any kernel claim. |
0 commit comments