Skip to content

Support chunked prefill for text-only causal LLMs#687

Closed
ncylich wants to merge 1 commit into
long-ctx-attnfrom
text-llm-chunked-prefill
Closed

Support chunked prefill for text-only causal LLMs#687
ncylich wants to merge 1 commit into
long-ctx-attnfrom
text-llm-chunked-prefill

Conversation

@ncylich

@ncylich ncylich commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Enables chunked prefill for plain text-only causal LLMs (qwen3 family). These previously transpiled to only a monolithic decoder + decoder_step, so the engine took the DIRECT_DECODER_STEP route and prefilled token-by-token (~27 tok/s). This emits the chunked-prefill component pipeline (lm_encoder_step, lm_encoder_text_chunk, decoder_media_step, decoder_prefill_chunk) via text-only embed/chunk adapters mirroring the multimodal qwen3.5 ones (minus vision), so the existing chunked-prefill engine route engages.

Result

  • ~5x faster prefill: 27 -> 130-300 tok/s (qwen3-0.6b @16k).
  • Identical generation output (verified: chat + NIAH retrieval match the token-by-token path). Shipped chunk size 512 — exact; smaller chunks accumulate int8 re-quant drift across chunk boundaries.

Safety

  • The default text transpile (decoder + decoder_step) and multimodal families are unchanged — the chunk pipeline is only emitted when those components are requested; transpiler default tests pass.
  • No engine change (the chunked-prefill route already supported text input).

Related

Existing chunked-prefill branches (v2-chunked-prefill, mtp-component-chunked-prefill) may overlap — this is a focused text-qwen3 addition; worth reconciling with that work.

Plain text causal LLM families (e.g. qwen3) previously transpiled to only
a monolithic decoder + decoder_step, so the engine took the
DIRECT_DECODER_STEP route and prefilled token-by-token (~27 tok/s). Emit
the chunked-prefill component pipeline (lm_encoder_step,
lm_encoder_text_chunk, decoder_media_step, decoder_prefill_chunk) for the
qwen3 family via text-only embed/chunk adapters that mirror the
multimodal qwen3.5 ones (minus vision), so the existing chunked-prefill
engine route engages. ~5x faster prefill (27 -> 130-300 tok/s) with
identical generation output. The default transpile (decoder +
decoder_step) and multimodal families are unchanged.

Signed-off-by: Noah Cylich <noah@cactuscompute.com>
ncylich added a commit that referenced this pull request Jun 4, 2026
Brings the text-only chunked-prefill adapters (Qwen3LMEncoderStep/TextChunk,
Qwen3EmbedsCausalLMPrefillChunk) into this branch so text qwen3 bundles can
transpile the chunked-prefill component pipeline (~5x faster prefill, identical
output), combined with this branch's RoPE-precompute + KV-compaction work.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
@ncylich ncylich closed this Jun 4, 2026
ncylich added a commit that referenced this pull request Jun 4, 2026
Builds on the merged #687 text chunked-prefill adapters: make the chunked
component pipeline (lm_encoder_step, lm_encoder_text_chunk, decoder_prefill_chunk,
decoder_media_step, decoder_step) the DEFAULT for text-only Qwen3ForCausalLM
conversions, instead of the monolithic decoder/decoder_step. Gated on the
Qwen3ForCausalLM architecture (the chunk adapters are qwen3-only); other
families and multimodal paths are unchanged. ~5x faster prefill, identical
output, and avoids the stale-bundle FP16-legalization crash on re-convert.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
ncylich added a commit that referenced this pull request Jun 4, 2026
Brings the text-only chunked-prefill adapters (Qwen3LMEncoderStep/TextChunk,
Qwen3EmbedsCausalLMPrefillChunk) into this branch so text qwen3 bundles can
transpile the chunked-prefill component pipeline (~5x faster prefill, identical
output), combined with this branch's RoPE-precompute + KV-compaction work.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
ncylich added a commit that referenced this pull request Jun 4, 2026
Builds on the merged #687 text chunked-prefill adapters: make the chunked
component pipeline (lm_encoder_step, lm_encoder_text_chunk, decoder_prefill_chunk,
decoder_media_step, decoder_step) the DEFAULT for text-only Qwen3ForCausalLM
conversions, instead of the monolithic decoder/decoder_step. Gated on the
Qwen3ForCausalLM architecture (the chunk adapters are qwen3-only); other
families and multimodal paths are unchanged. ~5x faster prefill, identical
output, and avoids the stale-bundle FP16-legalization crash on re-convert.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
jakmro added a commit that referenced this pull request Jun 8, 2026
…686)

* Support chunked prefill for text-only causal LLMs

Plain text causal LLM families (e.g. qwen3) previously transpiled to only
a monolithic decoder + decoder_step, so the engine took the
DIRECT_DECODER_STEP route and prefilled token-by-token (~27 tok/s). Emit
the chunked-prefill component pipeline (lm_encoder_step,
lm_encoder_text_chunk, decoder_media_step, decoder_prefill_chunk) for the
qwen3 family via text-only embed/chunk adapters that mirror the
multimodal qwen3.5 ones (minus vision), so the existing chunked-prefill
engine route engages. ~5x faster prefill (27 -> 130-300 tok/s) with
identical generation output. The default transpile (decoder +
decoder_step) and multimodal families are unchanged.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Add KeyDiff query-agnostic KV-cache compression

Compute-once-at-prefill KV compression: per (layer, kv-head), keep a
budget of tokens = attention sink + recent window + the most
distinctive middle tokens by the KeyDiff key-geometry score
(s_i = -cos(k_i, mean key)), then physically compact the survivors and
renumber their RoPE positions to a contiguous window. Two modes:
one-shot (compact once at end of prefill) and rolling bounded cache
(compact to target_len every time the cache reaches trigger_len).

Default OFF (exact no-op); enabled via config or the CACTUS_KV_COMPRESS /
CACTUS_KV_COMPRESS_ROLL env overrides. Supported on all-global-layer
models (Qwen3); subset compression on mixed/KV-shared architectures is
rejected with a warning (it would need per-layer positions). FP16 and
INT8 caches handled. 22 unit/integration tests cross-check the keep-set
math bit-for-bit against the reference and verify
compaction/renumber/dense-check/rolling.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Precompute RoPE cos/sin tables to fix long-context precision

The transpiled graph computed the RoPE angle (position * inv_freq) and ran it
through fp16, which cannot represent absolute positions > 2048 exactly. Past
~6k context the angle error reaches several radians and randomises cos/sin,
corrupting generation (verified against the MLX reference, which stays correct
through 16k).

Add an optimize_graph pass that precomputes cos/sin offline in fp64,
materialises them as fp16 constant tables, and gathers them by position id,
keeping the precision-critical angle off the runtime fp16 path with no new
kernels. Gated to the text decoder so it does not touch vision/audio encoder
rope. Handles models with multiple rope thetas (e.g. Gemma sliding vs global
layers) by resolving each cos/sin node's own inv_freq.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Fix chunked-prefill sliding-window KV cache corruption

run_chunked_prefill pads the final partial chunk up to chunk_size to fit the
fixed-size chunk graph. Global-attention layers trim the padding tokens from
the KV cache, but sliding-window layers do not: the padded chunk evicts real
recent tokens and shifts the entire sliding window by the padding count,
corrupting deep-context attention (e.g. Gemma needle retrieval past ~2k).

Disable tail padding when the prefill graph has a sliding-window KV cache and
process the tail token-by-token via run_step, which already produces an exact
cache. Detect via a new get_node_window_size accessor reading window_size off
the KV_CACHE_STATE nodes. Models without sliding-window caches (e.g. Qwen) are
unaffected.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Make rolling KV compaction the default for causal LLMs

KeyDiff compaction now defaults on (trigger 4096 -> target 2048) for causal-LM
generation, bounding the KV cache without an env var. CACTUS_KV_COMPRESS_AT
(trigger) and CACTUS_KV_COMPRESS_TO (target) override the defaults independently;
CACTUS_KV_COMPRESS_AT=0 disables. The one-shot budget-fraction mode (never shipped)
is removed. STT models are unaffected -- they transcribe via a separate API path
that never reaches the compaction hook; sliding-window models cleanly no-op via
the existing all-layers guard.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Compact global + re-rope sliding-window layers at compaction (engine-only)

Make rolling KV compaction work for Gemma-style hybrid models (interleaved
sliding-window + global layers) entirely in the engine, on a single position
frame, with no graph/transpiler change (works on existing bundles).

At each compaction, compress_kv_cache_keydiff now runs two passes over the
decoder cache states:
  - Pass 1 (global/full-attention layers): KeyDiff compact + renumber to
    0..B-1, as before.
  - Pass 2 (sliding-window layers): rotate the recent K rows [sink_size,
    current_seq_len) in place by -Δ (Δ = old frontier - B) using the layer's
    LOCAL rope theta, so the window tracks the renumbered global frontier.
    Sink rows stay fixed; V is never rotated. current_seq_len is untouched
    (graph eviction already bounds it).
RoPE is relative, so a uniform shift preserves all query·key offsets; a single
shared position_ids = B then serves global (renumbered) and sliding (shifted)
caches alike. The old all-global guard is replaced by global-subset compaction.

New pure free functions in kv_compress: rerope_recent_fp16/int8 and the int8
helper rotate_int8_row; compact_int8's dequant->rotate->requant is factored
into a shared requant_row (no behavior change). Tests T1-T6 cover the fp16/int8
re-rope, the int8 refactor (bitwise), the combined sliding+global rolling
invariant over 3 cycles, the local-theta requirement, and the no-op guards.

Verified: 17/17 free-function tests; model_loading/llm/vlm green; live Gemma
gemma-4-e2b-it stays coherent across ~8 in-decode compaction cycles (no
re-transpile); Qwen all-global path unchanged.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Default chunked prefill for text qwen3 conversions

Builds on the merged #687 text chunked-prefill adapters: make the chunked
component pipeline (lm_encoder_step, lm_encoder_text_chunk, decoder_prefill_chunk,
decoder_media_step, decoder_step) the DEFAULT for text-only Qwen3ForCausalLM
conversions, instead of the monolithic decoder/decoder_step. Gated on the
Qwen3ForCausalLM architecture (the chunk adapters are qwen3-only); other
families and multimodal paths are unchanged. ~5x faster prefill, identical
output, and avoids the stale-bundle FP16-legalization crash on re-convert.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Skip non-attention caches in KV compaction (fix LFM2/hybrid corruption)

compress_kv_cache_keydiff iterated all of decoder_->cache_states and classified
layers as compressible attention purely by the "sliding" substring in
layer_types. Hybrid models (LFM2: conv + gated-deltanet) bind their conv and
recurrent cache states into the SAME list, so those were KeyDiff-compacted as if
they were attention KV -> state corruption and out-of-bounds heap r/w (the
kv_heads/head_dim==0 guard is bypassed because conv metadata lands nonzero in
those header slots). Compaction is default-on, so this corrupted any hybrid model
past the trigger.

Fix: in both compaction passes, skip any cache state whose key node op type is
not KV_CACHE_STATE (the same op-type discrimination the engine already uses for
recurrent/conv caches in run_chunked_prefill). Conv and gated-deltanet caches are
now left untouched; pure-attention and sliding+global models are unaffected.

Verified: LFM2 (lfm2.5-vl-1.6b) now generates coherently with compaction firing
(was <|pad|> garbage at the trigger); Qwen all-global and Gemma sliding+global
unchanged; kv unit tests 17/17.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Re-rope KV-shared global source layers with global theta

Compaction pass 2 re-roped every non-compressible layer's recent keys with
the local (sliding) theta, assuming the complement of the compressible set is
all sliding-window layers. That is false for KV-sharing hybrids (Gemma): a
global layer that is a KV-shared source is excluded from compaction yet is not
sliding, so it was re-roped with the wrong theta -- desyncing the cache that
the shared global layers attend to.

Pick the re-rope theta from each layer's own type via is_sliding_layer (shared
with physical_compressible_layers' is_full check) instead of from
compressibility: sliding -> local theta, full-attention -> global theta.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Disable kv_compress flag on invalid rolling config

validate_kv_compress zeroed trigger/target on an invalid config but left
kv_compress=true, inconsistent with the explicit-disable path in
parse_kv_compress_override. Set kv_compress=false here too so both disable
paths leave the same state.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Remove NIAH KV-compress test fixtures and consumer test

The MLX-fixture consumer test (test_kv_compress.cpp) and its
generated fixtures were research artifacts. The self-contained
free-functions test covers the KV-compress math without them.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Trim excessive comments across KV-compress and rope-table changes

Cut comments that restate self-evident code; keep only non-obvious
WHY (banker's rounding edge case, per-layer re-rope theta source-layer
subtlety, fp16 rope-angle precompute rationale).

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Replace standalone rope-table test with one integrated optimize-pass test

The standalone suite depended on gitignored transpiled weight bundles
(skipped on CI). Replace it with a self-contained synthetic-graph test
in test_optimize_gemma4_attention.py asserting precompute_rope_tables
rewrites the runtime cos angle to an fp16 embedding-table lookup keyed
by the position input, with positions past 2048 representable.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Drop redundant keydiff_score formula comment

The s_i formula is already on the header declaration; keep only the
double-accumulation rationale.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Rename test_kv_compress_free_functions.cpp to test_kv_compress.cpp

The old test_kv_compress.cpp (the MLX-fixture consumer) was removed, freeing
the canonical name for the self-contained unit tests.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Shorten KV-compress comments

Trim multi-line comments to two focused lines and drop low-value restatements.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Hoist RoPE rotation table out of compaction row loops

rope_rotate_row recomputed pow(theta, -2i/d) and cos/sin for every row.
Factor a RopeRot (inv_freq via pow once, then cos/sin) so compaction and
re-rope build it once: compact/un-rope reuse one inv_freq across rows, and
the uniform-delta re-rope paths build the full cos/sin table a single time.
Public API and per-row math are unchanged (bit-identical; 17/17 tests green).

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Shorten RopeRot comment

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Share one un-rope table across a compaction's keep-set scoring

Keep-set scoring un-ropes every key row, but the un-rope rotation depends only
on (position, head_dim, theta) -- identical across all compressible layers
(all global, same theta) and all kv-heads. It was rebuilt per (layer, head).

Build the per-position cos/sin table once (unrope_table) and thread it through
keepsets_from_fp16/int8; compress_kv_cache_keydiff builds it on the first
compressible layer and reuses it. Keep-sets are bit-identical (17/17 tests).
A full 28-layer 4096->2048 compaction drops from ~650 ms to ~335 ms.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Vectorize int8 dequant in KV compaction

Extract the int8 dequant (int8 * per-group scale) that was duplicated across
compact_int8, rotate_int8_row, and the int8 keep-set fill into one dequant_row
helper, and add a float32x4 NEON path (widen int8 -> float, multiply by the
group scale). Bit-identical to scalar; ~1.5x on int8 keep-set scoring.

A g_use_simd toggle (kv_set_simd) selects the scalar fallback so the new
dequant_simd_matches_scalar test can compare both on a NEON build; non-ARM
builds compile only the scalar path. fp16 rope/keydiff were left scalar --
clang -O3 already auto-vectorizes those double loops.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Renumber compaction via table composition (drop per-survivor trig)

Compaction's renumber rotated each survivor by (rank - abs_pos), computing
cos/sin per survivor -- the dominant cost of compact. Since re-rope by +rank
is the conjugate of the shared un-rope table (same cos, negated sin), renumber
becomes: un-rope by -abs_pos (table[abs_pos]) then re-rope by +rank
(conjugate of table[rank]) -- two table lookups, no per-row trig. compact_fp16
and compact_int8 take the shared un-rope table model.cpp already builds; the
theta overloads build it from the kept indices for callers/tests.

Two rotations vs one differ by ~2 ULP double, well below fp16/int8 resolution,
so the stored cache is unchanged. A 28-layer 4096->2048 fp16 compaction drops
from ~153 ms to ~32 ms (full compaction ~327 -> ~203 ms). 18/18 tests pass.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Skip thinking-token cache strip on a compacted cache

remove_thinking_tokens uses absolute token positions as cache row indices, but
a rolling compaction renumbers rows to 0..B-1 and shrinks the cache. After a
compaction the absolute ranges no longer map, so the strip decremented
cache_total_seq_len_ while skipping the per-row memmove, desyncing the cache.

Track a cache_renumbered_ flag (set in compress_kv_cache_keydiff, cleared in
reset_cache) and early-return from remove_thinking_tokens when set; the caller
still erases processed_tokens, and the bounded thinking rows linger until the
next compaction evicts them. Only affects GEMMA4 + thinking after a compaction.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Preserve special tokens across KV-cache compaction

KeyDiff rolling compaction could evict structural special tokens (BOS,
end-of-turn, think delimiters) when they were not recent or distinctive,
degrading multi-turn conversations.

Force-keep them: a per-cache-row token map (cache_token_ids_) tracks the
text decode path and is gathered through each compaction with the
first compressed layer's head-0 keep-set (canonical view -- exact in
head 0, best-effort elsewhere). compress_kv_cache_keydiff unions the
positions whose token is special into Params::protect, which the
keep-set reserves just after the sink. Special ids come from the new
Tokenizer::special_token_ids() (captures turn/think tokens the config
parser misses). Guarded by cache_token_ids_ sync + no media; gated by
kv_compress_preserve_special (default on) / CACTUS_KV_PRESERVE_SPECIAL.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Tighten PR comments to non-obvious behavior only

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Grow the KV cache on demand and move it across the prefill handoff

Transpiled decoders bake max_cache_seq_len (e.g. 262144 for qwen3-vl)
and the runtime allocated the full buffer up front -- ~32 GB for a tiny
prompt, almost all of it zeroed and untouched.

Make the KV_CACHE_STATE buffer growable: it starts at 256 entries and
doubles (relocating the int8 scales region, preserving stored rows) up
to the baked ceiling, so RAM tracks actual occupancy. Sliding-window
layers keep a fixed window-sized buffer; only global layers grow.

The prefill->decode handoff now transfers buffer ownership (a cross-graph
std::move via steal_cache_buffer) instead of copying into a second
allocation, so both caches are never resident at once -- halving peak.
The padded-prefill path truncates the moved cache and re-runs the last
token instead of re-copying. copy_cache_states/ensure_cache_capacity are
removed as dead.

For qwen3-vl this takes a short prompt from ~32 GB to ~150 MB with
byte-identical output, and a 5k-token needle prompt peaks ~1.1 GB.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Use a media-specific default auto-prompt in the chat test tool

When an image/audio is attached without --prompt, the tool auto-sent the
generic "Describe the attached input.", which some VLMs (e.g. qwen3-vl)
degenerate into a repetition loop on. Pick the prompt by attached media
("Describe this image." / "Transcribe or respond to this audio.").

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Support full-context cache transpilation

Add explicit cache context sizing for cached component transpilation so decoder graphs can reserve the model context length instead of deriving capacity from the capture prompt. Gemma4, Qwen, and LFM cached decode builders now share the same config-driven resolver while preserving their existing capture-derived fallback sizes when no model context field is available.

For Gemma4, sliding-attention layers keep the configured sliding-window cache size while full-attention layers receive the requested full cache length. Add a retranspile path that reuses existing converted weights, wire the cache context option through the CLI and transpile entrypoint, and make lowering honor per-attention cache length metadata.

Add token-file input support for exact-token live generation and a context-scaling benchmark tool that records CSV/JSONL summaries and plots. Cover the new CLI plumbing, retranspile behavior, context parsing, and per-layer cache metadata with focused tests.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Size Qwen chunked-prefill caches from the full context too

Full-context transpilation wired the cache-length resolver into the
Qwen decoder_step path but left the chunked-prefill block
(decoder_prefill_chunk / decoder_media_step) on the old
max(1024, prompt+512) formula. So a re-transpile lifted decoder_step's
ceiling but the prefill components stayed at 1024, and prompts past
1024 tokens still overflowed the baked RoPE table during prefill.

Resolve those components through _max_cache_seq_len as well.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Supply cloud-handoff args in the chunked-bundle-flags run test

cmd_run forwards --confidence-threshold/--cloud-timeout-ms/--no-cloud-handoff
(features on this branch), but the full-context test's args Namespace omitted
them, raising AttributeError after the two were merged. Add the attributes.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Size LFM2 decoder_step cache from the full context too

The LFM2 decoder_step path was the last component still deriving
max_cache_seq_len from the capture prompt (max(1024, prompt+512))
rather than the model context; its chunked sibling already used the
resolver. Route it through _max_cache_seq_len so LFM2 reserves the full
context like the other families.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Carry max_position_embeddings in common graph meta for rope precompute

Full-context rope-table precompute raises when a component has neither
max_cache_seq_len nor max_position_embeddings in its meta. Non-cached
components (e.g. LFM2's full-context decoder) set neither, so re-transpiling
multimodal bundles failed at precompute. Seed common graph meta with the
model's max_position_embeddings as the rope-table fallback.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Don't assert precision in the cache move handoff

steal_cache_buffer asserted the destination and source buffers share a
precision, but the destination is the step node's not-yet-executed baked
buffer (INT8) while the source is the runtime prefill buffer, which is
FP16 under CACTUS_KV_CACHE_FP16. The move replaces the destination buffer
wholesale, so its precision follows the source; only op_type is invariant.
This aborted every fp16-cache run at the handoff.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Drop comments the code already expresses

Keep only the non-obvious rationale (the fp16 precision invariant in the
cache move, the padded-prefill re-run, the rope-table meta fallback).

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Trim kv_compress.h comments to the load-bearing math

Drop the declaration comments that restate the signature and tighten the
rest to one line, keeping only what the code can't express: the KeyDiff
score, the POST-RoPE rotation convention, the K/V renumber semantics, and
the per-family compressible-layer selection.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Bake rope tables only into the lowered graph, not the saved IR

optimize_graph baked the precomputed rope tables into the IR it returns,
so the component pipeline's saved optimized_ir.json carried them too --
which the rope-table tests can't read back (the values live in
graph.cactus, not the JSON). Gate precompute behind a flag and run it on
the deepcopy the component pipeline lowers, leaving the saved optimized IR
un-baked. graph.cactus is still baked; runtime is unchanged. Other
transpile paths keep baking via the default.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Fix conv/recurrent cache handoff and guard MLA-dim compaction

The prefill->decode cache move iterated key then value node ids; conv and
recurrent caches serialize one node as both, so the second steal_cache_buffer
moved the already-emptied source over the destination and blanked the buffer.
Dedup the move so a shared node is transferred once.

Also preflight-skip KeyDiff compaction when any compressible layer's V head
dim differs from its K head dim (MLA-style): compaction strides both by one
head_dim and would otherwise corrupt the value rows.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Preserve special tokens per-head across KV-cache compaction

Special-token protect was built from the head-0-anchored cache_token_ids_ and
applied as global positions, so after the first rolling compaction (which keeps
per-head-divergent rows) non-head-0 heads protected the wrong rows and could
drop a mid-context special across cycles.

Track special rows per (compressible layer, head) in a SpecialRowTracker,
rebuilt lazily from the still-accurate appended region of cache_token_ids_ and
remapped through each head's keep set after compaction. Feed a per-head protect
into keepsets so every head reserves its own specials; KeyDiff scoring stays
per-head. Add a specials-fit preflight that skips the pass rather than letting
a special be truncated, and disable tracking once an untracked compaction (e.g.
media) diverges the heads.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Don't pass stale per-head protect when tracking is disabled

keepsets received special_rows_.protect(li) unconditionally, so a compaction
with per-head tracking off (media, preserve off, or post-invalidate) would still
override Params::protect with stale tracker rows. Pass empty protect unless
per_head_protect, clear the rows on invalidate(), and test the Params::protect
fallback.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Trim per-head-protect comments to one line each

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Drop per-head-protect comments that restate the code

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Disable KV compaction for Gemma4 thinking mode

Compaction renumbers the cache, which the Gemma4 thinking-token strip can't
follow, so previous-turn thoughts leak once a compaction has run. Until the
strip is compaction-aware, suppress compaction for thinking requests (set in
do_prefill, checked in maybe_roll_compact); non-thinking and other models are
unaffected. Gemma4 is the only model that strips prior thoughts.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Suppress compaction before the first-token fast path too

The fast prefill_and_sample_first_token path (text-only, empty cache) bypasses
do_prefill, where compaction_suppressed_ was set, so Gemma4 thinking requests on
that path still compacted. Set the flag before the fast path as well.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* cleaned comment

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Shrink the KV cache after compaction to reclaim long-prefill capacity

The growable cache only ever grew to peak occupancy and was never reclaimed, so
a long prefill (moved into decode) left the buffer over-allocated at prompt
length even after compaction dropped occupancy to target. After compacting each
layer, shrink its buffer to the smallest power of two >= trigger, which holds
the decode occupancy oscillation [target, trigger] without re-growing. Factor a
shared resize_cache_buffer used by grow and the new shrink_cache_buffer.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Trim shrink/suppression comments to the load-bearing ones

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Keep Gemma-4 thinking in the KV cache like LiteRT

Stop stripping generated thinking for Gemma-4 so it persists across turns
as ordinary context (Qwen already does this).

- Remove the in-cache strip path: strip_thinking_from_cache,
  Model::remove_thinking_tokens, find_channel_token_ranges, and the
  now-unread cache_renumbered_ member/writes.
- Remove the compaction-suppression stopgap (compaction_suppressed_,
  set_compaction_suppressed, set sites, and the maybe_roll_compact guard)
  so compaction runs normally for thinking context.
- format_gemma4_style keeps assistant content verbatim (drop strip_channel)
  so prior-turn thinking stays in the rendered prompt.
- Rename strip_thinking_block -> partition_thinking_response: it now only
  splits generated text into (thinking, content) for the API output.
- Emit a non-user-facing context_response (raw assistant text before
  partitioning) in the result JSON for conversation-history persistence.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Persist Gemma-4 thinking in chat surfaces; rewrite thinking test

- chat.cpp and server.py store context_response (raw assistant text with
  thinking) as conversation history while still displaying the clean
  response, so thinking survives the next turn's re-render.
- Rewrite test_gemma4_thinking for the new contract: the formatter retains
  assistant channel/thinking content, the API response stays clean, and
  re-rendered multi-turn history prefix-matches the cache.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Fix review: clean OpenAI response content; drop stale v1 thinking test

server.py returned the raw context_response (channel tags) as public message
content; return the clean response (context_response is only for stateful chat
history). Delete the superseded v1 tests/test_gemma4_thinking.cpp, which still
asserted the removed strip contract; the active test lives in cactus-engine/tests.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Remove stale iOS harness reference to deleted thinking test

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Remove channel-token config orphaned by the thinking-strip deletion

find_channel_token_ranges was the only reader of channel_open/close_token_id;
drop the now-unused config fields and their parse. Revert the one-line
server.py refactor to the original inline form.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Reproduce Qwen thinking opener in history so KV cache reuses across turns

Qwen's chat template injects a thinking opener (<think>\n, or
<think>\n\n</think>\n\n when thinking is disabled) only in the
generation-prompt branch. The model continues from that opener, so the
opener tokens are stored in the KV cache as part of the assistant turn.
Re-rendering that turn as history did not reproduce the opener, so the
next turn's prompt diverged from the cached tokens at the assistant
boundary, prompt_context_matches failed, and every turn did a full
re-prefill. Emit the same opener per assistant history turn (selected
from whether the content already closed </think>) so the re-templated
history byte-matches the cache.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Add cache_context_length parameter to component specs functions

* Shorten keepsets_from_fp16 comment to one line

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Remove gemma4 context-scaling benchmark script

Standalone manual benchmark harness, not part of the test suite or
referenced anywhere; out of scope for the PR.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

* Consolidate gemma4 thinking tests into test_llm

Move the four gemma4 thinking tests into test_llm.cpp (next to the
prefill cache-reuse tests) and delete the standalone suite. The three
model-gated tests run only when the chosen model is Gemma4 and warn-skip
otherwise; partition_thinking_response is a pure unit test that always
runs. Share a load_gemma4_or_skip helper instead of repeating the gate.

Signed-off-by: Noah Cylich <noahcylich@gmail.com>

---------

Signed-off-by: Noah Cylich <noahcylich@gmail.com>
Co-authored-by: jakmro <kubamroz124@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant