feat(llama-cpp-localai-paged): paged KV cache llama.cpp backend + cross-request prefix sharing + GB10 decode optimization [WIP] by localai-bot · Pull Request #10462 · mudler/LocalAI

localai-bot · 2026-06-23T09:05:26Z

Status: draft / WIP - opened to track ongoing GB10 enterprise-serving work. Large branch (kernel experiments + analysis + the shippable feature); will be curated before any merge.

What this is

Vendored, opt-in paged KV cache + cross-request prefix sharing for the llama.cpp backend, plus GB10 (consumer Blackwell, sm_121) decode-path optimization and the supporting analysis. All paged behaviour is gated by LLAMA_KV_PAGED (env) / the kv_paged server option and is off by default - stock builds are byte-identical.

Shippable feature pieces

backend/cpp/llama-cpp/patches/paged/0001-0011 - vendored llama.cpp patch series, applied behind the LLAMA_PAGED build flag (patches/paged/, default on; LLAMA_PAGED=off gives a clean upstream checkout). Isolated in prepare.sh + Makefile with a sentinel guard against double-apply.
grpc-server.cpp - kv_paged per-server option (0005) + cross-request prefix share wired into update_slots (0008).
core/backend/hardware_defaults.go, pkg/xsysinfo/gpu.go - hardware-aware default consolidation.

Key results (measured on DGX Spark / GB10, Qwen3-32B NVFP4)

Prefix sharing (RAG / system-prompt fan-out): the cross-request cache reaches the server - concurrent shared-prefix requests skip recompute, 15-25x burst-wall prefill reduction (K=16/32), reuse confirmed (ref_cnt=K, suffix-only prefill).
Decode kernel: an in-kernel block-table read (0009) deletes the per-step gather regression; routing the GQA-grouped tile kernel by default (0011) brings paged decode to stock parity (within 1.8%), growing to -6.1% at 16k ctx. Token-correct (CPU byte-identical, GPU within the CUDA batch-shape non-determinism band).
Honest framing: the earlier "6x decode gap vs vLLM" was a measurement artifact (the pre-0009 gather regression + a warmup/server number). The real steady-state decode is GPU-bound at the LPDDR5x bandwidth wall (~50% irreducible weight-read floor), ~1.2x from vLLM at the kernel level. Remaining throughput headroom is at the scheduler/serving layer, not the kernel.

Analysis docs live under backend/cpp/llama-cpp/patches/paged/*.md and backend/cpp/llama-cpp/paged/*.md.

Not for merge as-is

This branch also contains banked W4A16/Marlin kernel experiments and NVFP4/MXFP4 quality analysis that informed the direction but are not part of the feature. Those will be dropped/split before merge.

Host-side paged-attention block manager ported faithfully from vLLM V1 (block_pool.py, kv_cache_utils.py, single_type_kv_cache_manager.py): - KVCacheBlock + intrusive LRU FreeBlockQueue (O(1) middle removal) - BlockPool: get_new_blocks / touch / free_blocks eviction ordering / cache_full_blocks / lazy eviction on reuse - PagedKVManager: on-demand allocate, block_table, slot arithmetic (slot = block_id*block_size + offset), free - Prefix caching: chained block hashing + find_longest_cache_hit (first-miss stop), enabling automatic cross-tenant prefix sharing Pure C++17, zero ggml/llama.cpp dependency, unit-tested to vLLM behavioral parity (4/4 suites green). Parity is on algorithm/behavior, not hash bytes. Phase 0 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Phases 1-5 (ggml storage, gather-to-scratch read path, Gate 0 correctness, benchmark wins, prefix-share serving) follow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Validate the paged KV read/write path at the ggml-op level, driven by PagedKVManager: - write: ggml_set_rows(pool, k_src, slot_mapping) scatter K rows by slot - read: ggml_get_rows(pool, gather_idx) gather a seq's slots into contiguous scratch (the tensor an attention kernel consumes) The test forces a non-contiguous, out-of-order physical block layout (allocate seqA+seqB, free seqA, reallocate seqC -> blocks [2,1,5]) and proves gather(write(x)) == x plus cross-sequence isolation in the shared pool. This de-risks the central question (does slot-addressed paged storage round-trip correctly through ggml) before the llama-graph integration. Pool is statically allocated via ggml_backend_alloc_ctx_tensors, mirroring how llama.cpp allocates its KV cache. CPU backend, no new ggml op. Built against ggml from the vendored llama.cpp checkout. Phase 1 of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Retire the central numeric risk from the design: feeding gather-to-scratch KV (a sequence whose blocks are non-contiguous in the shared pool, [2,1,5]) into ggml's standard attention ops produces correct attention. Path under test: set_rows write -> get_rows gather (K and V) -> mul_mat(K,Q) -> soft_max_ext -> mul_mat(V^T, probs). Result is compared against an independent host-computed softmax attention over the same K/V/Q. Max abs error ~7.5e-08 (n_kv=48, d=8, n_q=4). This proves the paged read path is numerically sound on CPU with no new ggml op. Remaining: wire build_attn_paged into llama-graph.cpp and validate Gate 0 (token-identical greedy generation in a real model). Phase 2 (core) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Quantify the two multi-tenant wins that are properties of the host-side block model (vLLM-parity), independent of the in-model compute path: WIN 1 concurrency capacity @ 512-block budget contiguous (reserve n_ctx/seq): 4 sequences paged (on-demand blocks): 37 sequences --> 9.2x more concurrent sequences WIN 3 cross-tenant prefix sharing (32 tenants, 1024-tok shared prefix) prefix-cache OFF: 2176 physical blocks prefix-cache ON: 192 physical blocks --> 11.3x less KV memory WIN 2 (throughput) is deliberately reported as PENDING: it requires the paged gather-read path wired into llama-graph.cpp (Gate 0) and is not measurable at the allocation layer. The win-1 baseline is per-sequence n_ctx reservation (stream mode); llama.cpp's unified cache already shares one pool, so the honest win there is on-demand sizing + prefix dedup. Phase 3 (partial) of docs/superpowers/plans/2026-06-19-paged-attention-llamacpp.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Capture verified state (P0 manager parity, P1 ggml write/gather, P2 attention numerics 7.5e-08, P3 capacity 9.2x + prefix-sharing 11.3x) and the exact remaining work: wire build_attn_paged into llama-graph.cpp and validate token-identical generation on Qwen3-0.6B (Gate 0), then win-2 throughput. Records the integration seams (create_memory, find_slot, get_k/get_v, build_attn, mask) and the honest caveats (unified cache already shares a pool; vLLM's classic kernel is deprecated) so the next session starts warm. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…KV placement Wire paged, non-contiguous fixed-size BLOCK placement into the real llama.cpp KV cache (find_slot), behind env LLAMA_KV_PAGED, and validate Gate 0 on a real GGUF: Qwen3-0.6B greedy generation is TOKEN-IDENTICAL to the contiguous cache while its KV is physically scattered across permuted blocks (cells 0-15, 144-159, 32-47, ...). Proven non-contiguous via LLAMA_KV_PAGED_DEBUG, not a silent fallback. This retires the correctness premise of paged attention IN THE MODEL (not just at the ggml-op level): attention is invariant to physical KV placement, because reads use per-cell pos/seq metadata for masking. The patch lives at patches/0001-paged-kv-block-placement.patch (against llama.cpp 0253fb21f). Scope: storage/placement layer, single sequence. Remaining (P4): the gather-read compute path (attend only a seq's own blocks) for the throughput win, and the multi-sequence driver. README updated with repro + status. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Captures the full dgx.casa investigation: Q8/F16/vLLM baselines, concurrency sweeps, paged-patch (no concurrency effect), nsys+code root-cause (MoE int8 MMQ on Ampere-class tensor cores = 74.5% compute, no FP8 path), and the lever plan. Measured wins: - Lever 1 (MXFP4 / Blackwell FP4 path): decode +50-66% over Q8, prefill plateau +66% (2200->3650). MXFP4 decode beats vLLM FP8 at B=1 (83 vs 48), near-parity B=8. Prefill still plateaus (fused-MoE-GEMM gap). - Lever 2 (ubatch): saturates at 2048; ceiling is the kernel, not batch. Designed (not built): Lever 3 fused FP4/FP8 MoE grouped GEMM, Lever 4 FP8 GEMM (needs ggml_mul_mat_ext scale plumbing), Lever 5 tcgen05 kernels, and the complete paged attention (on-demand alloc + gather-read + continuous batching + prefix sharing). Honest scope: each is multi-week kernel/systems work. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

On NVIDIA Blackwell consumer GPUs (sm_120/121, incl. GB10/DGX Spark) a larger physical batch (n_ubatch) materially lifts MoE prefill throughput - measured on a GB10 with Qwen3-30B-A3B to lift the prefill ceiling and saturate at ~2048. When a model config leaves `batch:` unset, EffectiveBatchSize now picks 2048 on Blackwell instead of 512; explicit `batch:` always overrides. Detection is a shared, cached Go helper (xsysinfo.IsNVIDIABlackwell, nvidia-smi compute_cap >= 12). Logic is isolated in core/backend/hardware_defaults.go and applied at the common ModelOptions builder, so it covers the C++ llama.cpp backend too. Measured (GB10, Qwen3-Coder-30B-A3B MXFP4): prefill ub512 2994 -> ub2048 3316 t/s; saturates past 2048. Also recorded in the DGX gap plan: 4-bit quant alone captures the decode win (Q4_K_M 93.5 >= MXFP4 86.4 t/s), MXFP4's only edge is prefill via Blackwell FP4 tensor cores. Tests: hardware_defaults_internal_test.go; existing NBatch specs pinned to the no-Blackwell branch for determinism. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Prefill doesn't scale with bigger single prompts (attention O(N^2)); real gap is batched MoE prefill (B=32: 27x vs vLLM, ~22 effective TFLOP/s). nsys pins Lever 3 target: mul_mat_q<MXFP4> MoE GEMM 37% + un-fused act-quant 8%; native FP4 MMA already engaged, inefficiency is the per-expert thin-tile scheduler. Q4_K_M matches MXFP4 on decode (decode win is generic 4-bit); MXFP4's only edge is prefill. Auto-ubatch=2048 on Blackwell shipped (PR #10411). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…m ggml issue draft Plan A (Lever 3): phased path to FP4 MoE GEMM parity — cheap tweaks, act-quant fusion, then the real lever (tcgen05/CUTLASS grouped GEMM), full-model FP4. Plan B (paged attention): on-demand pool, gather-read + Gate 0, continuous batching, prefix sharing; benchmark in memory-pressured/mixed-length regimes. Upstream issue draft: GB10 numbers, nsys profile, ruled-out config knobs, tcgen05 proposal. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

static_assert(nwarps*tile_C::I == mmq_y) locks nwarps=8 for mmq_y=128; can't raise occupancy without co-scaling mmq_y (blows Blackwell smem). MMQ kernel is not freely tunable -> parity needs the tcgen05/CUTLASS rewrite, not knobs. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…s from-scratch No tcgen05/CUTLASS grouped-GEMM MoE kernel exists upstream (merged/in-flight/ draft); CUTLASS not a dep; no fork has one; activation-quant gather already fused. Matching vLLM needs a from-scratch tcgen05 grouped GEMM (months, maintainers deferring to cuTile). No tractable patch closes the 27x. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ttention Numbered patches under backend/cpp/llama-cpp/patches/ applied in order against the pinned LLAMA_VERSION (build hook in the llama.cpp: target). Each phase is one small, independently-buildable patch so the work rebases cleanly across llama.cpp bumps (anti-drift). README defines the series (0001 vendor manager -> 0006 prefix caching) + the regen workflow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

First patch of the stacking series. Adds src/paged-kv-manager.{h,cpp} (the CPU-verified vLLM-parity block manager) + CMake entry. No behavior change. Generated against the pinned LLAMA_VERSION; applies clean. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ical find_slot places a sequence's tokens at permuted non-contiguous blocks; greedy generation is token-identical to stock (verified on Qwen3-0.6B at the pin), branch confirmed firing. Default off. The placement substrate for the gather-read. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Every edit mapped (gather-index graph input mirroring k_idxs; gather K/V/mask by one aligned index; n_kv compaction; gated so stock stays byte-identical) with the token-identical gate and the known risks (mask transpose layout, v_trans). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

… single-stream first Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Prefill 6-48x behind and does NOT scale with B (kernel-bound, paging can't fix). Decode: we win at B=1; 2.5-3.7x behind at B>=8 - THAT concurrency gap is the engine's domain (0004 pool + 0005 continuous batching target it). Baseline for the series to improve on. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…is 54.6% MoE GEMM too Decode-dominated B=64 nsys: mul_mat_q<MXFP4> 54.6%, attention only 19.8%. Both phases are FP4-MoE-kernel-bound (Lever 3). The paged series cannot close the vLLM gap in either phase; its real value is capacity + prefix-sharing, not tok/s parity. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…Lever 3) The only work that closes the vLLM gap on Blackwell: mul_mat_q<MXFP4> is 37% prefill + 54.6% decode-B64 GPU time; paged attention can't touch it (proven). Scaffold (builds clean on GB10, default byte-identical): fp4-grouped-moe.{cuh,cu} entry + gated hook in ggml_cuda_mul_mat_id (env GGML_CUDA_FP4_GROUPED), always falls back to MMQ for now. Design doc has the CUTLASS/tcgen05 implementation phases + parity harness + the dense-path follow-up (#28). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…PP 7.6-32x) vLLM W4A16 vs llama Q4_K_M dense: prefill 7.6-32x behind (llama plateaus ~765, vLLM scales to 24.4k); decode ~parity at B=1 (weight-bandwidth-bound), 2.2x at B=64. Full NVFP4 (W4A4) hangs on this vLLM/GB10 stack - W4A16 used. Decision: the Lever-3 kernel track must ALSO deliver a non-grouped FP4 dense GEMM, not just the MoE grouped GEMM (dense GEMM is the simpler first kernel to land). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…n grouped) Benchmark confirms dense prefill 7.6-32x behind too, so the kernel track needs a non-grouped FP4 dense GEMM (simpler, land first) + the MoE grouped GEMM. Both share the e2m1 block-scaled collective; dense is grouped-with-one-group. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ry flag lever exhausted Confirms parity (dense+MoE, both phases) is strictly the FP4 tensor-core kernel; no config/flag shortcut remains. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…line Researched: W4A4 hangs on GB10 because FlashInfer ships no FP4 cubins for sm_120/121 (all datacenter Sm100a); dense mm_fp4 is gated-off/returns-zeros on consumer Blackwell, and the FlashInfer FP4 autotuner spins on the first forward pass. Not a misconfig - dense W4A4 inference isn't validated on sm_121. W4A16 (4-bit weight / 16-bit act, Marlin) vs llama Q4_K_M is the correct apples-to- apples (same quant class) AND the fast path. Removed the misleading 'W4A4 would be faster / lower bound' framing. Sources: vllm #30163/#26381, flashinfer #2577/#3294, cutlass #3096. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Key corrections: (1) vLLM 24k is AGGREGATE; single-stream roofline ~3300 t/s (BF16) / 6600 (FP4). (2) GB10 is 1:1:2 BF16:INT8:FP4 - INT8 == BF16, only FP4 is 2x. (3) Measured: dense int8-MMQ at 21% of ceiling, MoE FP4-MMQ at ~5% - both EXIST, just untuned for Blackwell. Strategy: to MATCH vLLM, tune MMQ or build a Marlin-style W4A16 BF16 GEMM (FP4 NOT required); to BEAT, fix the existing FP4 MMA on sm_121 (build/miscompile, not greenfield). Dropped the tcgen05 grouped GEMM rewrite. Cheap next test: dense MXFP4 quant + existing FP4-MMA. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

… (~17% of ceiling) MXFP4 dense moves prefill off int8-MMQ onto the FP4-MMA path (existing kernel) for a free 1.44x - shippable as a Blackwell dense-quant recommendation. But it's ~17% of the FP4 roofline, so the FP4-MMA kernel is itself untuned: ~4-6x still in the kernel. Sharpens the target to TUNING the FP4-MMA (serves dense+MoE, only path to beat vLLM). Marlin-style W4A16 BF16 is the alt to match on the BF16 ceiling. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ever Per-user decode is at parity without spec-dec (10.2 vs 11.7, bandwidth-bound). vLLM's per-user speed = speculative decoding (lossless, target-verified). GB10 is best-case (bandwidth-bound + idle compute); llama.cpp spec-dec measured 2.9x on dense Qwen2.5-32B. Qwen3-32B has no native MTP - use Qwen3-1.7B draft or EAGLE3 head. Recommendation: make spec-dec easy for dense >=14B on Blackwell (keeps Q4_K_M quality, no kernel). Prefill-kernel + continuous-batching are separate (TTFT / aggregate). Our own DGX run pending (box rebooted, llama-cli hangs). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Phase 1 (config, PR #10411, DONE): VRAM-scaled n_parallel + Blackwell batch. Phase 2: paged KV (PR #22569, ~9.5x concurrency). Phase 3: chunked prefill + n_batch/ubatch split. Phase 4: batched-GEMM kernel tuning. Phase 5: backend sampling. Cross-cutting: spec-dec for dense. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

… plan Decisive DGX experiment: rebuilt with -DGGML_CUDA_FORCE_CUBLAS (it's a compile #ifdef, not the runtime env we'd been setting - so prior 'cuBLAS no-op' tests never engaged it). Real result: cuBLAS is SLOWER than MMQ for dense Q4 (pp2048 690 vs 750) and runs an Ampere cutlass_80_tensorop kernel - CUDA-13 has no sm_121 GEMM, falls back to sm_80. So both MMQ and cuBLAS sit at ~46 TFLOP/s; no library shortcut to the 213 ceiling on GB10. Confirms a hand-tuned sm_120a kernel is required. Added the phased W4A16 Marlin-style implementation plan (P0 harness -> P5 enable) as the committed multi-week build; corrected the cuBLAS note. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

… skills Two .agents guides (indexed in AGENTS.md): - llama-cpp-localai-paged-backend.md: what the CUDA-only paged backend is, the patchset scope, the bit-exact gate, the manual pin-sync + weekly canary, the CUDA-only / stock-stays-pure invariants, and the Metal/SYCL/Vulkan follow-up scope. - vllm-parity-methodology.md: the decode-parity playbook (bit-exact gating, profile-don't-assume, both-engine ground-truth, per-lever A/B, recording rejected levers, multi-agent GPU orchestration). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…to docs/ The llama-cpp-localai-paged patches/ dir had accumulated docs, plots, a csv, dev .cpp harnesses, and a dead FP4-MoE kernel scaffold after an earlier git-mv. Restore the invariant that patches/ holds only the .patch series. Moves: - patches/paged/README.md -> README.md (canonical doc at the backend root) - patches/paged/{PIN_SYNC_c299a92c,PAGED_BITEXACT_NOTE,LOCALAI_LLAMACPP_BACKEND_PLAN,UPSTREAM_LAYER2_SCOPE}.md, final_benchmark.csv, qwen36_*.png, paged-burst-bench.cpp, paged-reclaim-unit.cpp -> docs/ - patches/README.md -> docs/PATCH_MAINTENANCE.md (unique patch-regen recipe not in the canonical README) Deletes: - patches/BENCHMARKS.md (superseded by README section 4 + the dev-notes section) - patches/kernel/ (dead FP4-MoE scaffold, never in the 0001-0030 apply glob, zero refs repo-wide) Repoint every reference to the moved files: README internal links (docs/ + the .github links drop from 5x ../ to 3x ../), .agents/llama-cpp-localai-paged-backend.md, .github/scripts/paged-canary-apply.sh, .github/workflows/llama-cpp-paged-canary.yml, the wrapper Makefile, backend/cpp/llama-cpp/grpc-server.cpp, backend/index.yaml, docs/content/features/backends.md, gallery/index.yaml. The build apply glob PAGED_PATCHES_DIR/0*.patch (PAGED_PATCHES_DIR := .../patches/paged) is unchanged and still resolves to the 28 patches. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…rver link The c299a92c bump diverged 23 commits ahead of the stock llama-cpp pin. grpc-server.cpp is SHARED with the stock backend and tracks the stock pin; c299a92c's upstream server-API refactor pulled stream_* helpers into the headers grpc-server.cpp includes, whose definitions the stock-aligned build does not compile -> every paged variant failed to LINK (undefined reference to stream_aware_should_stop / stream_pipe_producer::cleanup / stream_session_attach_pipe). The bump was greedy-md5 bit-exact, but the bit-exact gate never exercises the full grpc-server build, so it slipped through. Revert LLAMA_VERSION to 9d5d882d (== stock pin, where the patches are bit-exact AND grpc-server links - the original DGX-proven baseline). Document the hard constraint in the Makefile, README, PIN_SYNC record, and the .agents guide: the paged pin must track the stock pin, and a pin-sync must pass the full CI grpc-server build, not only the bit-exact gate. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

The paged backend's llama.cpp pin was reverted from c299a92c back to 9d5d882d (== stock), so docs/PIN_SYNC_c299a92c.md (a blow-by-blow of the reverted sync) is dead weight. The pin-sync PROCESS stays documented in the three live places: the Makefile comment, README section 7 (Pin + maintenance policy), and .agents/llama-cpp-localai-paged-backend.md. Delete the doc and repoint every reference to it (Makefile, README, .agents, canary script + workflow) at README section 7. No functional paths change: the canary's patches-dir glob (patches/paged/0*.patch) is untouched. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…bf16-tau) Re-run the GB10/DGX-Spark llama-batched-bench matrix (dense q36-27b + MoE q36-35b-a3b, npl 8/32/64/128, -fa on -ngl 99 -npp 128 -ntg 128) so the CSV and README section 4 carry a single consistent set of llama numbers with all three configs: - stock: separately-built unpatched llama.cpp at this backend's exact pin 9d5d882d (toggling LLAMA_KV_PAGED on the patched binary does NOT reproduce stock - the SSM decode fusions are compiled in, not env-gated). - patched: paged binary, LLAMA_KV_PAGED=1 (+LLAMA_MOE_FORCE_GRAPHS=1 for MoE). - patched+bf16-tau: patched plus --ssm-bf16-tau 64 (opt-in, NOT bit-exact, ~91% same-top-p). final_benchmark.csv now has stock + patched + bf16-tau + vllm rows for both models at all four widths (the prior CSV had no stock and no bf16-tau rows). peak_gb is dropped: the GB10's unified LPDDR5x reports [N/A] to nvidia-smi and the bench does not print it, so per-run peak could not be captured this session. Patch series gives up to 2.46x (dense) / 2.26x (MoE) over true-stock; opt-in bf16-tau adds a further +3% to +17% on top of patched (growing with width). vLLM column is kept from the prior session (not re-run) and labeled as such. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…-in wins The DGX re-run showed toggling LLAMA_KV_PAGED on/off on the patched binary does NOT reproduce stock: the dominant SSM decode fusions are compiled in, not runtime-gated, so the toggle measures only the (here ~neutral) paged-KV part. True stock needs a separately-built unpatched binary at the same pin. Correct the methodology skill's per-lever discipline + apples-to-apples rule accordingly. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…overview Rebuild the two committed decode plots from the re-measured CSV and add a combined overview. Three series per the comparison that matters: llama.cpp (standard) vs vLLM vs LocalAI's llama.cpp patches; x-over-standard called out at npl128. bf16-tau stays out of the plot (it remains in the CSV + the README table as the opt-in row). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Per request, the plots now show all four series: llama.cpp (standard), vLLM, LocalAI's llama.cpp patches (bit-exact hero), and LocalAI's patches + bf16-tau (opt-in ceiling, +3% to +17% over the patches, ahead of vLLM at every dense width and MoE npl>=32). Subtitle flags bf16-tau as opt-in / not bit-exact. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…CUDA-13 only The paged backend targets Blackwell sm_121a, which CUDA 12.0 cannot target at all, so the CUDA-12 variants were pointless. They were also broken: the cublas-12 / nvidia-l4t / arm64 build failed to compile paged-kv-manager.cpp ("no declaration matches ...", a ~10-function mismatch the older cuda-12-base gcc rejects). CUDA-13 compiles it fine (confirmed on GB10). Removed (config-only, scoped to the paged backend): - backend-matrix.yml: the two CUDA-12 paged rows (-gpu-nvidia-cuda-12-llama-cpp-localai-paged, -nvidia-l4t-arm64-llama-cpp-localai-paged) - backend/index.yaml: CUDA-12 capability keys (nvidia-cuda-12, nvidia-l4t-cuda-12, nvidia-l4t) on both meta-backends, repointed default/nvidia to the cuda13 amd64 variant, and dropped the orphaned cuda12-* / nvidia-l4t-arm64-* variant definitions (latest + -development). Kept CUDA-13 only: cuda13-llama-cpp-localai-paged (amd64) and cuda13-nvidia-l4t-arm64-llama-cpp-localai-paged (l4t arm64). Matrix tag-suffixes <-> index variant URIs form a clean 2:2 bijection. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Fixes cuda-13 amd64 / non-arm64 build where size_t was used without the header (arm64 cuda-13 pulled it in transitively; amd64/cuda-12 toolchains do not). Compile-only change, bit-exactness unaffected. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ix amd64/non-arm64 build; compile-only) Vendored paged headers used size_t / uintN_t without including <cstddef> / <cstdint>. The arm64 DGX toolchain provides them transitively so the build passed there, but amd64/older toolchains do not, failing the CI amd64 build one header at a time ('size_t' does not name a type -> cascade). paged-kv-manager.h was already fixed. This adds the missing includes to the remaining vendored headers at the point each is created/rewritten in the patch series so every src/paged*.h self-includes both: * paged-attn.h (0003): add <cstddef> (had <cstdint>) * paged-alloc.h (0007): add <cstddef> (had <cstdint>) * paged-prefix-api.h (0007): add <cstddef> + <cstdint> (had only llama.h) The .cpp units include their own paged header, so they inherit the includes transitively. Whole series still applies clean on the pinned llama.cpp. Compile-only change: no runtime behavior change, bit-exactness unaffected. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ve/restore (patch 0026) The opt-in ssm_bf16_tau hybrid mode splits a gated-DeltaNet layer's recurrent SSM state into an f32 partition (s_l) and a bf16 partition (s_l_bf16). The recurrent state serialization paths (state_write_data / state_read_data) were never updated for the split: they read/wrote s_l using the FULL hparams.n_embd_s() (S_v*S_v*H) row width, but a split layer's s_l only holds S_v*S_v*n_f32, so the access overruns the smaller tensor (a ggml_backend tensor read out of bounds), and the bf16 fast-head partition was never persisted at all. This is what broke high-concurrency serving with --ssm-bf16-tau: the server's context-checkpoint feature serializes per-sequence state via state_seq_get_data. With a checkpoint enabled, even a single request triggered the out-of-bounds read; at higher concurrency the cell range starts at a higher base slot so the overrun reaches further (hard abort in a debug build, silent state corruption then 1-token-then-EOS on restore in a release build). The static batched-bench never exercises save/restore so it did not catch it; the GDN decode kernel and per-head partition offsets were already correct (decode with checkpoints disabled is fine at N=8/16/32). Fix: serialize the f32 partition and, when the layer is split, the bf16 partition right after it, each with its OWN row width (tensor ne[0]). head_slot is rebuilt deterministically at load (same model + tau), so it is not serialized. Non-split layers have ne[0] == n_embd_s() and no bf16 partition, so their on-disk format and behavior are byte-identical (the default f32 path and the bit-exact gate are unaffected). Verified on GB10/DGX with Qwen3.6-35B-A3B-NVFP4 + --ssm-bf16-tau 64 via a continuous-batching llama-server: with context checkpoints enabled, N=8, N=16 and N=32 (slot reuse + restore) all now produce full coherent 128-token output and the server stays up; pre-fix the same config aborted on the first checkpoint. Assisted-by: Claude:claude-opus-4-8[1m] [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

master auto-bumped the stock llama-cpp pin 9d5d882d -> 0ed235ea and updated the shared grpc-server.cpp. The paged backend's pin must track the stock pin (the grpc-server.cpp is shared), so bump its LLAMA_VERSION to match. All 28 paged patches apply clean on 0ed235ea (verified against a fresh upstream clone). The bf16-tau state-serialization fix (patch 0026) is included. Bit-exact gate + full grpc-server build verify on GPU/CI to follow. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

… (+ROCm) Add ACCELERATOR_PORTING_SCOPE.md, the umbrella scope for taking the paged backend's accelerator-portable wins off the CUDA family. It builds on (does not duplicate) UPSTREAM_LAYER2_SCOPE.md, which stays the GDN/SSM-fusion detail (benefit #1), and adds: - Benefit #2 (paged KV in-kernel block-table flash-attn read, 0009-0011): new per-backend feasibility from source analysis of the Metal/SYCL/Vulkan flash-attn kernels. SYCL EASY (near line-for-line CUDA mirror), Metal EASY-MEDIUM (decode already routes to the vec kernel), Vulkan MEDIUM (the fast coopmat2 NVIDIA decode path cannot do the indexed read; push-constants are full). Universal constraint: only the vec/scalar decode kernel admits the per-cell indexed read, so route block-table ops onto vec (as CUDA's 0009-0010 dispatch guard already does) and leave the fast MM/coopmat2 path contiguous-only. This is the lever that flips paged KV from neutral-to-slightly-negative to non-negative off CUDA. - Benefit #3 (decode-first scheduler, 0013/0016): confirmed a free portable win - host-side update_slots() policy, zero kernel work, runs on any accelerator as-is. - Benefit #4 (NVFP4 FP4-MMA, 0017/0023/0025): out of scope (Blackwell only); flags the backend-agnostic analogues of the act-quant dedup and the graph-coverage lever without over-claiming a port. - A ROCm note: ROCm rides the CUDA/HIP path (validate, don't re-port); FP4-MMA stays Blackwell-only. Benefits #1 and #2 share the port shape and rank Metal->SYCL->Vulkan, so they bundle into one per-backend PR behind a shared ops-first PR. Cross-link added from UPSTREAM_LAYER2_SCOPE.md. All gates are test-backend-ops on-target (no Metal/SYCL/Vulkan/ROCm hardware here). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

… pin 0ed235ea llama.cpp renamed the RPC tool target (tools/rpc/CMakeLists.txt: set(TARGET ggml-rpc-server)) at the 0ed235ea pin. master already updated the stock llama-cpp Makefile to match (--target ggml-rpc-server, cp bin/ggml-rpc-server); the paged backend's separate Makefile copy was left stale and its -grpc (RPC) variant failed with 'No rule to make target rpc-server' (grpc-server itself built to 100%). Mirror the stock rename in the paged Makefile. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…0000 flat, zero speed benefit) The opt-in hybrid per-head bf16 SSM-state lever (ssm_bf16_tau, patch 0026) is removed from the llama-cpp-localai-paged patch series. Clean re-measurement after the decode fusions (0028 recurrent-state gather-fusion + 0029 block-table cache) landed shows it buys nothing: forcing ALL gated-DeltaNet heads to bf16 (tau=100000, the most aggressive setting) gives flat decode throughput, 780.6 vs 780.0 t/s. The mode engages but adds zero speed because it is subsumed by the fusions. The earlier "+12%" was measured before the fusions completed. bf16-tau was a precision trade (not bit-exact, ~91% same-top-p) plus extra bug surface and extra CUDA template-instantiation compile cost with no offsetting benefit. Dependency check: no later patch (0028/0029/0030) depends on 0026. 0030's only mention is a description comment; its code keys off fused_gdn_ar/ch/auto_fgdn, which originate in 0018/0019/0021 (before 0026). The remaining series (0001-0025, 0028-0030) applies clean with git apply --check against the pin 0ed235ea2c17a19fc8238668653946721ed136fd. The Makefile applies the series by glob (patches/paged/0*.patch); the resulting gap at 0026 is tolerated (0005/0027 are already absent). Removed: - patches/paged/0026-qwen35-hybrid-perhead-ssm-state.patch - the dead ssm_bf16_tau / ssm_hybrid_tau option handler in the shared grpc-server.cpp (it only set LLAMA_SSM_BF16_TAU, now a no-op the library no longer reads) - the patched+bf16-tau benchmark columns and llama-patched-bf16tau rows (README + final_benchmark.csv), the ssm_bf16_tau option text in backend index.yaml, the gallery NOTE block, and the docs/features/backends.md mention. The rejected-lever lesson is kept (why it was dropped: subsumed, tau=100000 flat) in the backend README section 5, the paged-backend agent guide, and the vLLM-parity methodology, so it is not re-tried. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Design + plan for the #1 prefill lever: NVFP4 weight GEMM at large M, where MMQ (decode/M<=128-tuned, 1 CTA/SM, 128-col tile cap) is ~3.4x slower than vLLM's marlin/cutlass large-M path (~51% of the prefill gap). Recommends (a) dequant->bf16 cuBLAS routed by an M-threshold (dense first, MoE grouped-cuBLAS second); rejects (b) a from-scratch Marlin/FP4 kernel as a multi-week project. Key enabling finding: NVFP4->bf16 dequant kernels already exist, and NVFP4 is currently force-excluded from the tensor-core cuBLAS path (falls to f32 Sgemm) - relaxing that one guard is the pivot. Honest: bf16-cuBLAS banks ~60-75% of the GEMM gap, not full 68us/tok parity (bf16 TC peak ~half FP4). Design only - no kernel, no GPU run. Signed-off-by: Ettore Di Giacinto <mudler@localai.io> Assisted-by: Claude:claude-opus-4-8 [Claude Code]

Adds patch 0031 to the paged llama.cpp series: an FLA-style chunked parallel-scan prefill kernel for gated DeltaNet (the upstream gated_delta_net.cu "Add chunked kernel for even faster pre-fill" TODO). Scope: non-KDA scalar gate, f32 state, final-state-only, homogeneous. Bit-exact-benign (NEW per-path): test-backend-ops GATED_DELTA_NET 91/91 within the 1e-7 NMSE gate vs the CPU reference (patch adds 8 S_v=128 prefill cases: exact-multiple / tail / multi-seq / GQA / permuted); numpy prototype confirms f32 chunked-vs-sequential NMSE ~1e-13. OPT-IN, default OFF: GB10's 99KB dynamic-smem opt-in forces C=16 (the 128x128 f32 state is 64KB of the all-shared layout), pinning the kernel to 1 block/SM with serial dk-reductions. Measured ~761 t/s chunked vs ~971 t/s sequential (~22%% slower) on q36-27b-nvfp4 prefill, so it defaults OFF (enable with GDN_CHUNK_MIN=<n>); the backend default is regression-free. Beating the 84.7%-of-peak sequential scan needs tensor-core matmuls / register-resident state with larger chunks (recorded in README section 5). Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ign-only) Add DECODE_SERVING_SCOPE.md: the decode KERNEL is at parity in static batched-bench (~6.1 tok/s/seq ~ vLLM ~5.9 at npl128) but continuous serving through llama-server update_slots() drops to ~3.7 (-39%) while vLLM sustains ~5.9. Scope shows the gap is the scheduler/host loop, not the kernel. Root-cause hypothesis from source: continuous batching's batch-shape + seq-set churn breaks BOTH graph-reuse layers every step - llama-context can_reuse/ allow_reuse (n_tokens + seq-set must match) and the CUDA ggml_cuda_graph update_required memcmp (ne/nb/data ptrs) - so the GPU idles while the host rebuilds + re-captures the graph and runs un-graphed set_inputs. vLLM avoids this with padded/bucketed decode shapes + piecewise CUDA graphs. Documents that the shipped scheduler patches (0008/0013/0016/0024/0025/0029) target prefill freezing + burst collapse, NOT decode-step graph reuse, which is why the serving gap survives them; notes the README s.5 'lever 2 graph coverage FLAT' verdict was static-regime and is reopened here for serving only. Ranks host-side, bit-exact-safe levers: S1 bucketed/padded decode-step shape for graph reuse, S2 double-buffer/overlap per-step host work, S3 graph-shape-stable scheduling (extend 0016). Specifies a Phase-0 profile to confirm host-bound before any build, reusing the in-tree [L5INSTR] hostproc/set_inputs/ get_block_table timers, the 'graphs reused' perf counter, LLAMA_GRAPH_REUSE_DISABLE and nsys GPU-busy%, with vLLM ground-truthed at the same concurrency. No kernel code; no GPU run in this pass. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

Scopes the follow-up recorded by patch 0031 + README section 5: replace the serial per-thread reductions of the chunked gated-DeltaNet prefill scan with mma.sync tensor-core matmuls and lift the 1-block/SM occupancy ceiling, the path that would beat the tuned sequential scan and close the GDN prefill bucket toward vLLM's ~2.5x-cheaper chunked scan. Confirmed (not assumed) the GB10/sm_121a tensor-core reality: consumer Blackwell (SM12x) has NO wgmma (Hopper-only) and NO tcgen05/TMEM (sm_100a data-center only); the usable path is the extended mma.sync family. So the kernel is a warp-synchronous mma.sync + cp.async design (reusing ggml's mma.cuh tiles), not a wgmma/TMA/tcgen05 design - patch 0031's 'mma/wgmma' shorthand reads as mma only on this part. Design: register-resident state frees the 64KB that forced C=16, admitting C=64 under the 99KB shared opt-in; tf32 inputs / f32 accumulate with a 3xtf32 precision ladder; decays/gamma/beta stay f32 outside the mma to preserve the bounded de-gating; A-inverse via blocked forward substitution (FLA UT transform) with mma off-diagonal coupling. Mechanism: chunking cuts state-BW ~Cx, mma absorbs the O(C^2) intra-chunk flops the serial 0031 could not. Honest: multi-week, high risk, no vendor kernel to route to on sm_121; gains beat the sequential scan and close most of the bucket but not full sm_100-class parity. KL-gate binding (NMSE likely fails at reduced precision). Phased: re-profile -> two-product PoC -> full intra-chunk + C=64 + reg-state -> occupancy/cp.async; opt-in default-OFF until A/B-proven. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

… 0033, default-off) Option (a) of PREFILL_GEMM_SCOPE.md: route large-M (prefill) NVFP4 dense weight GEMMs off the decode-tuned FP4-MMQ kernel onto the dequant->bf16 cuBLAS (nvjet) tensor-core path, wired via an M-threshold in ggml_cuda_should_use_mmq. Lands the validated, bit-exact-gated mechanism and records the honest GB10 result: it is a regression, so it ships default-off (== stock), mirroring the patch-0017 default-off discipline. Three-edit scaffold (no new kernel): should_use_mmq routes NVFP4+Blackwell+dense M>LLAMA_FP4_PREFILL_M to cuBLAS; op_mul_mat_cublas gains an NVFP4 branch that dequants the FP4 weights to a transient bf16 pool buffer (not cached - stays FP4-resident) and runs cublasGemmEx CUDA_R_16BF/COMPUTE_32F; ggml_get_to_bf16_cuda gains the NVFP4 case. Bit-exact gate PASS (benign): test-backend-ops MUL_MAT 1146/1146 + MUL_MAT_ID 806/806; the forced path (LLAMA_FP4_PREFILL_M=64) is green CUDA-vs-CPU at NVFP4 large-M shapes; greedy md5 on q36-27b is byte-identical to FP4-MMQ both for short prefill (5951a5b4, decode untouched) and for a >threshold prefill that exercises the bf16 path (5f3967df - no greedy argmax flips). Performance REGRESSES on GB10 (S_PP, q36-27b dense, A/B via env): M=512 958.99 -> 486.65 (-49%), M=1024 1013.65 -> 587.27 (-42%), M=2048 918.46 -> 649.42 (-29%). The scope premise (FP4-MMQ ~3% of FP4 peak at large M) is false here: FP4-MMQ beats bf16-cuBLAS because bf16 peak is ~half FP4 peak and the per-step weight dequant + 4x bf16 weight traffic (~8x total vs the FP4 read) dominate, only partially amortizing as M grows. Default-off keeps stock S_PP (966.98). Phase 2 (MoE grouped large-M) not implemented: it inherits the same bf16-peak<FP4-peak ceiling plus a per-expert dequant, so grouped bf16-cuBLAS would regress for the same reason; a real prefill GEMM win needs option (b), a native FP4-MMA large-M kernel. Full A/B in docs/PREFILL_GEMM_RESULTS.md. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…0040/0041) Add the two decode-serving graph-reuse levers (validated on GB10) that close the host-bound serving gap (paged dropped to ~3.7 vs vLLM ~5.9 tok/s/seq in real continuous serving while tying it in static batched-bench). - 0040 S1 paged decode-graph reuse: the paged decode inputs never overrode llm_graph_input_i::can_reuse (defaults false), so the host rebuilt the ggml graph on EVERY decode step (layer-A reuse 0%). Add a 256-bucketed-shape can_reuse + a live-mctx refresh from the owning attn input. Bit-exact (md5 byte-identical reuse on/off). Static batched-bench: paged reuse 0% -> 95.5%. - 0041 S3 decode-shape-stable scheduling: keep co-batched prefill out of decode steps so the scheduler emits the reuse-stable pure-decode shape S1 can reuse. Default-off policy on top of 0016; bit-exact (per-stream independent). S1+S3 together (128-client staggered serving, MoE Qwen3.6-35B-A3B-NVFP4): graph reuse 0% -> 72.2%, hostproc 15.98 -> 6.31 ms/step, decode 4.05 -> 5.52 tok/s/seq median (4.24 -> 5.96 mean, at vLLM's ~5.9). S1 alone is insufficient (13.8%); S3 is the multiplier. S2 (double-buffer set_inputs) dropped: Phase-0 put set_inputs at ~0.05 ms/step, so it has nothing to recover. README patch table + DECODE_SERVING_SCOPE.md updated with results and the padded/fixed-slot follow-up. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…n under paged KV FIX A (patch 0031 compose break): the chunked GDN prefill patch carried '#include <cuda_bf16.h>' and '#include <type_traits>' as CONTEXT lines, but those were introduced by the dropped bf16-tau patch 0026, so on the bf16-tau-free 0001-0030 base only '#include <cstdlib>' is present and 'git apply' failed. The same 0026 drop also shifted 0031's later hunks off their context (the ', hyb' kernel-launch arg, the 'STATE_BF16, HYBRID' template params, and the GDN_LAUNCH_ARGS list). Regenerated 0031 against a fresh pin(0ed235ea) + 0001-0030 tree: the chunked kernel now SELF-PROVIDES the cuda_bf16.h / type_traits includes (adds them, plus the climits it needs for INT_MAX) and the dispatch guard is the 2-param 'if constexpr (!KDA && !keep_rs_t)' form. Behaviour is unchanged: 0031 stays opt-in, default OFF (GDN_CHUNK_MIN), a recorded negative. The full 0001-0042 series now applies clean on 0ed235ea ('git apply --check' green for every patch). FIX B (patch 0041 S3 default): the decode-shape-stable scheduler defaulted OFF. Make it default ON whenever paged KV is active (LLAMA_KV_PAGED set), still overridable to off via LLAMA_PAGED_DECODE_STABLE=0. Minimal host-side change in update_slots(); re-exported from the dev tree, README 0041 row updated to match. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…cted The S1 section-(a) padded/fixed-slot decode shape (the scoped follow-up to push serving graph reuse from ~72% toward ~100%) was implemented in an isolated worktree off the committed S1/S3/tail base, built CUDA-only, and benched on GB10. Verdict: REJECTED. It is bit-exact and provably inert, but it regresses serving throughput at every concurrency and does not close the vLLM gap. Implementation (default-off, LLAMA_PAGED_PAD_DECODE): on a pure-decode step (n_prompt_budgeted == 0) emit a masked-inert dummy decode for every idle slot so n_tokens / n_seqs / n_seqs_unq / n_outputs and the seq-id set stay constant; a release()-side guard keeps a finished slot warm under padding. Each dummy is its own sequence (private recurrent state, per-stream paged attention, logits discarded), so it cannot perturb a real stream. Gates: single-seq greedy md5 bit-exact (dense 5951a5b4, paged-MoE 8cb0ce23). The literal per-stream ON-vs-OFF identity gate is unachievable - concurrent cuBLAS/FA decode is not bit-reproducible run-to-run even with padding off (OFF-vs-OFF diverging streams: dense 3/16, MoE 8/16). The achievable inertness gate passed: ON-vs-OFF per-stream prefix-agreement equals the OFF-vs-OFF noise floor exactly (MoE 0.940/0.940, dense 0.812/0.812), so the dummy slots leak nothing. Bench (MoE Qwen3.6-35B-A3B-NVFP4, GB10), burst decode tok/s/seq: n=8 S1+S3 28.16 / PAD 6.05 / vLLM 44.8; n=128 S1+S3 4.53 / PAD 4.32 / vLLM 6.87. Staggered aggregate tok/s: baseline (reuse 0%) 757.6, S1+S3 (reuse 72%) 763.3, PAD (reuse 38%) 558.0. Why it fails: (1) serving decode here is GPU-compute-bound, not host-rebuild-bound - baseline reuse 0% ~= S1+S3 reuse 72% on aggregate tok/s, so closing reuse buys ~nothing (the earlier 542->762 host-bound delta did not reproduce); (2) padding adds dummy-row compute proportional to pad_width - real_load, catastrophic at low load; (3) in continuous serving padding cannot hold a constant width (perpetual prefill churn) so reuse drops 72% -> 38%; (4) the completion-driven batch shrink padding prevents is itself a throughput win in a compute-bound regime. The residual burst gap is GPU-compute, which a host-side reuse lever cannot close. Patch series unchanged: this rejected lever is NOT added to patches/paged/. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…ression) Patch 0041 (LLAMA_PAGED_DECODE_STABLE) was made default-on-when-paged, but a measured end-to-end A/B proved that is a serving mistake. S3 defers prefill admission on the period-8 cadence, which delays prompt admission: 2.5x worse TTFT (60s vs 24s at N=256) and 20-29% lower end-to-end throughput, with no end-to-end win at any concurrency. Its apparent decode_agg gain was a metric artifact (faster per-step decode bought by starving prefill). Flip the s3_enabled default so an unset LLAMA_PAGED_DECODE_STABLE means OFF; the mechanism stays available as an explicit opt-in (LLAMA_PAGED_DECODE_STABLE=1) for decode-dominated, low-arrival traffic where TTFT is not a concern. The default now prefers prompt prefill admission for good TTFT. S1 (patch 0040) keeps shipping default-on; only S3's default changes. Re-exports patch 0041 (change folded into its source commit) and updates the README 0041 row plus the decode-serving narrative to record the A/B finding. Greedy md5 gate unchanged (single-sequence llama-completion path, not update_slots): paged MoE 8cb0ce23, dense 5951a5b4. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…-on (0043); FP4-MMA W4A4 (0034) + Marlin W4A16 (0035) MoE-GEMM scaffolds default-off 0042 fuses the pre-norm residual add into RMSNorm (+0.5% prefill, bit-exact). 0043 makes the full-step MoE decode CUDA graph default-on (+2-4% decode, bit-exact; removes ~18x per-step host kernel re-issue, A/B-confirmed). 0034 (native FP4-MMA W4A4) and 0035 (Marlin-style W4A16 grouped MoE GEMM) are correct + bit-exact but regress vs the int8 FP4-MMQ in-backend on GB10 (bf16 MMA is ~half the int8 rate); shipped default-off as validated mechanisms and recorded negatives per the parity methodology. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…-engine profile-validated) Lever map records the full prefill/decode gap decomposition vs vLLM, the ranked levers, and the rejected dead ends. GDN build plan is the per-product mma mapping + A-inverse + occupancy design. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…r paged KV (patch 0044) Land the tensor-core forms of the chunked gated-DeltaNet prefill scan (0031) as a single GDN_TC-selected build and ship the M5 variant (full TC form-T solve + state-update mma) default-ON when LLAMA_KV_PAGED is set. The dispatch defaults GDN_TC=5 and GDN_CHUNK_MIN=64 under paged KV (both env-overridable; OFF/INT_MAX when not paged, so stock/non-paged stays regression-free). GDN_CHUNK_MIN is the per-call engage threshold and stays > 1 so decode (1 tok/call) keeps the sequential recurrence; 64 was tuned from a {1,32,64,128,256} sweep (32/64/128 all win on prefill, 256 barely fires because the MoE-prefill per-call count is < 256, 1 collapses decode S_TG ~25%). Measured GB10, q36-35b-a3b-nvfp4, LLAMA_KV_PAGED=1 LLAMA_MOE_FORCE_GRAPHS=1, llama-batched-bench -ngl 99 -fa on -ntg 4 -npl 32: -npp 512 S_PP 2208.96 -> 2286.5 t/s (+3.5%, mean of 3 interleaved A/B) -npp 2048 S_PP 2021.5 -> 2379.8 t/s (+17.7%) Decode S_TG unchanged (~399 vs ~397 t/s, within noise). Bit-exactness (per-path greedy md5, n=48 --temp 0 --seed 1, paged): default-on == M5-forced == canonical on the gate prompt - MoE 8cb0ce23, dense 5951a5b4. test-backend-ops GATED_DELTA_NET 94/94 vs CPU with M5 forced (incl. multi-chunk up to n_tokens=256). On a long MoE prompt the default (M5 fires at >=64 tokens) and the sequential path agree word-for-word until one benign greedy token-flip; dense is byte-identical. The chunked scan is a NEW per-path result (different FP reduction order), NMSE-validated benign. CUDA-only, gencode arch=compute_121a,code=sm_121a (GB10 / sm_121a). README sections 3 (0044 row, 0031 superseded note) and 5 (dev-notes verdict) updated. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

…negatives (default-off) nsys cross-engine decomposition: the MoE prefill 64% gap vs vLLM is engine plumbing, not the kernel (GPU 97% busy, 443 vs 197 us/tok). Three buckets: per-expert W4A4 M-fragmentation (58%), GDN scan (24%), f32<->bf16 casts (15%). Offline-repack (0045) and verbatim vLLM-marlin port both trail FP4-MMQ via wrapper overhead, kept default-off as recorded negatives. Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler added 30 commits June 19, 2026 08:26

docs(paged): series status 0001/0002 done+verified; honest parity note

c4b4f3a

Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

docs(paged): refine 0003 plan - used-cell gather, per-ubatch rebuild,…

48fbb93

… single-stream first Assisted-by: Claude:opus-4.8 [Claude Code] Signed-off-by: Ettore Di Giacinto <mudler@localai.io>

mudler added 6 commits June 27, 2026 12:58

mudler marked this pull request as ready for review June 27, 2026 22:19

mudler added 23 commits June 27, 2026 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(llama-cpp-localai-paged): paged KV cache llama.cpp backend + cross-request prefix sharing + GB10 decode optimization [WIP]#10462

feat(llama-cpp-localai-paged): paged KV cache llama.cpp backend + cross-request prefix sharing + GB10 decode optimization [WIP]#10462
localai-bot wants to merge 194 commits into
masterfrom
worktree-feat+paged-attention

localai-bot commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

localai-bot commented Jun 23, 2026

What this is

Shippable feature pieces

Key results (measured on DGX Spark / GB10, Qwen3-32B NVFP4)

Next

Not for merge as-is

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants