Support chunked prefill for text-only causal LLMs#687
Closed
ncylich wants to merge 1 commit into
Closed
Conversation
Plain text causal LLM families (e.g. qwen3) previously transpiled to only a monolithic decoder + decoder_step, so the engine took the DIRECT_DECODER_STEP route and prefilled token-by-token (~27 tok/s). Emit the chunked-prefill component pipeline (lm_encoder_step, lm_encoder_text_chunk, decoder_media_step, decoder_prefill_chunk) for the qwen3 family via text-only embed/chunk adapters that mirror the multimodal qwen3.5 ones (minus vision), so the existing chunked-prefill engine route engages. ~5x faster prefill (27 -> 130-300 tok/s) with identical generation output. The default transpile (decoder + decoder_step) and multimodal families are unchanged. Signed-off-by: Noah Cylich <noah@cactuscompute.com>
ncylich
added a commit
that referenced
this pull request
Jun 4, 2026
Brings the text-only chunked-prefill adapters (Qwen3LMEncoderStep/TextChunk, Qwen3EmbedsCausalLMPrefillChunk) into this branch so text qwen3 bundles can transpile the chunked-prefill component pipeline (~5x faster prefill, identical output), combined with this branch's RoPE-precompute + KV-compaction work. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
ncylich
added a commit
that referenced
this pull request
Jun 4, 2026
Builds on the merged #687 text chunked-prefill adapters: make the chunked component pipeline (lm_encoder_step, lm_encoder_text_chunk, decoder_prefill_chunk, decoder_media_step, decoder_step) the DEFAULT for text-only Qwen3ForCausalLM conversions, instead of the monolithic decoder/decoder_step. Gated on the Qwen3ForCausalLM architecture (the chunk adapters are qwen3-only); other families and multimodal paths are unchanged. ~5x faster prefill, identical output, and avoids the stale-bundle FP16-legalization crash on re-convert. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
ncylich
added a commit
that referenced
this pull request
Jun 4, 2026
Brings the text-only chunked-prefill adapters (Qwen3LMEncoderStep/TextChunk, Qwen3EmbedsCausalLMPrefillChunk) into this branch so text qwen3 bundles can transpile the chunked-prefill component pipeline (~5x faster prefill, identical output), combined with this branch's RoPE-precompute + KV-compaction work. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
ncylich
added a commit
that referenced
this pull request
Jun 4, 2026
Builds on the merged #687 text chunked-prefill adapters: make the chunked component pipeline (lm_encoder_step, lm_encoder_text_chunk, decoder_prefill_chunk, decoder_media_step, decoder_step) the DEFAULT for text-only Qwen3ForCausalLM conversions, instead of the monolithic decoder/decoder_step. Gated on the Qwen3ForCausalLM architecture (the chunk adapters are qwen3-only); other families and multimodal paths are unchanged. ~5x faster prefill, identical output, and avoids the stale-bundle FP16-legalization crash on re-convert. Signed-off-by: Noah Cylich <noahcylich@gmail.com>
jakmro
added a commit
that referenced
this pull request
Jun 8, 2026
…686) * Support chunked prefill for text-only causal LLMs Plain text causal LLM families (e.g. qwen3) previously transpiled to only a monolithic decoder + decoder_step, so the engine took the DIRECT_DECODER_STEP route and prefilled token-by-token (~27 tok/s). Emit the chunked-prefill component pipeline (lm_encoder_step, lm_encoder_text_chunk, decoder_media_step, decoder_prefill_chunk) for the qwen3 family via text-only embed/chunk adapters that mirror the multimodal qwen3.5 ones (minus vision), so the existing chunked-prefill engine route engages. ~5x faster prefill (27 -> 130-300 tok/s) with identical generation output. The default transpile (decoder + decoder_step) and multimodal families are unchanged. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Add KeyDiff query-agnostic KV-cache compression Compute-once-at-prefill KV compression: per (layer, kv-head), keep a budget of tokens = attention sink + recent window + the most distinctive middle tokens by the KeyDiff key-geometry score (s_i = -cos(k_i, mean key)), then physically compact the survivors and renumber their RoPE positions to a contiguous window. Two modes: one-shot (compact once at end of prefill) and rolling bounded cache (compact to target_len every time the cache reaches trigger_len). Default OFF (exact no-op); enabled via config or the CACTUS_KV_COMPRESS / CACTUS_KV_COMPRESS_ROLL env overrides. Supported on all-global-layer models (Qwen3); subset compression on mixed/KV-shared architectures is rejected with a warning (it would need per-layer positions). FP16 and INT8 caches handled. 22 unit/integration tests cross-check the keep-set math bit-for-bit against the reference and verify compaction/renumber/dense-check/rolling. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Precompute RoPE cos/sin tables to fix long-context precision The transpiled graph computed the RoPE angle (position * inv_freq) and ran it through fp16, which cannot represent absolute positions > 2048 exactly. Past ~6k context the angle error reaches several radians and randomises cos/sin, corrupting generation (verified against the MLX reference, which stays correct through 16k). Add an optimize_graph pass that precomputes cos/sin offline in fp64, materialises them as fp16 constant tables, and gathers them by position id, keeping the precision-critical angle off the runtime fp16 path with no new kernels. Gated to the text decoder so it does not touch vision/audio encoder rope. Handles models with multiple rope thetas (e.g. Gemma sliding vs global layers) by resolving each cos/sin node's own inv_freq. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Fix chunked-prefill sliding-window KV cache corruption run_chunked_prefill pads the final partial chunk up to chunk_size to fit the fixed-size chunk graph. Global-attention layers trim the padding tokens from the KV cache, but sliding-window layers do not: the padded chunk evicts real recent tokens and shifts the entire sliding window by the padding count, corrupting deep-context attention (e.g. Gemma needle retrieval past ~2k). Disable tail padding when the prefill graph has a sliding-window KV cache and process the tail token-by-token via run_step, which already produces an exact cache. Detect via a new get_node_window_size accessor reading window_size off the KV_CACHE_STATE nodes. Models without sliding-window caches (e.g. Qwen) are unaffected. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Make rolling KV compaction the default for causal LLMs KeyDiff compaction now defaults on (trigger 4096 -> target 2048) for causal-LM generation, bounding the KV cache without an env var. CACTUS_KV_COMPRESS_AT (trigger) and CACTUS_KV_COMPRESS_TO (target) override the defaults independently; CACTUS_KV_COMPRESS_AT=0 disables. The one-shot budget-fraction mode (never shipped) is removed. STT models are unaffected -- they transcribe via a separate API path that never reaches the compaction hook; sliding-window models cleanly no-op via the existing all-layers guard. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Compact global + re-rope sliding-window layers at compaction (engine-only) Make rolling KV compaction work for Gemma-style hybrid models (interleaved sliding-window + global layers) entirely in the engine, on a single position frame, with no graph/transpiler change (works on existing bundles). At each compaction, compress_kv_cache_keydiff now runs two passes over the decoder cache states: - Pass 1 (global/full-attention layers): KeyDiff compact + renumber to 0..B-1, as before. - Pass 2 (sliding-window layers): rotate the recent K rows [sink_size, current_seq_len) in place by -Δ (Δ = old frontier - B) using the layer's LOCAL rope theta, so the window tracks the renumbered global frontier. Sink rows stay fixed; V is never rotated. current_seq_len is untouched (graph eviction already bounds it). RoPE is relative, so a uniform shift preserves all query·key offsets; a single shared position_ids = B then serves global (renumbered) and sliding (shifted) caches alike. The old all-global guard is replaced by global-subset compaction. New pure free functions in kv_compress: rerope_recent_fp16/int8 and the int8 helper rotate_int8_row; compact_int8's dequant->rotate->requant is factored into a shared requant_row (no behavior change). Tests T1-T6 cover the fp16/int8 re-rope, the int8 refactor (bitwise), the combined sliding+global rolling invariant over 3 cycles, the local-theta requirement, and the no-op guards. Verified: 17/17 free-function tests; model_loading/llm/vlm green; live Gemma gemma-4-e2b-it stays coherent across ~8 in-decode compaction cycles (no re-transpile); Qwen all-global path unchanged. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Default chunked prefill for text qwen3 conversions Builds on the merged #687 text chunked-prefill adapters: make the chunked component pipeline (lm_encoder_step, lm_encoder_text_chunk, decoder_prefill_chunk, decoder_media_step, decoder_step) the DEFAULT for text-only Qwen3ForCausalLM conversions, instead of the monolithic decoder/decoder_step. Gated on the Qwen3ForCausalLM architecture (the chunk adapters are qwen3-only); other families and multimodal paths are unchanged. ~5x faster prefill, identical output, and avoids the stale-bundle FP16-legalization crash on re-convert. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Skip non-attention caches in KV compaction (fix LFM2/hybrid corruption) compress_kv_cache_keydiff iterated all of decoder_->cache_states and classified layers as compressible attention purely by the "sliding" substring in layer_types. Hybrid models (LFM2: conv + gated-deltanet) bind their conv and recurrent cache states into the SAME list, so those were KeyDiff-compacted as if they were attention KV -> state corruption and out-of-bounds heap r/w (the kv_heads/head_dim==0 guard is bypassed because conv metadata lands nonzero in those header slots). Compaction is default-on, so this corrupted any hybrid model past the trigger. Fix: in both compaction passes, skip any cache state whose key node op type is not KV_CACHE_STATE (the same op-type discrimination the engine already uses for recurrent/conv caches in run_chunked_prefill). Conv and gated-deltanet caches are now left untouched; pure-attention and sliding+global models are unaffected. Verified: LFM2 (lfm2.5-vl-1.6b) now generates coherently with compaction firing (was <|pad|> garbage at the trigger); Qwen all-global and Gemma sliding+global unchanged; kv unit tests 17/17. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Re-rope KV-shared global source layers with global theta Compaction pass 2 re-roped every non-compressible layer's recent keys with the local (sliding) theta, assuming the complement of the compressible set is all sliding-window layers. That is false for KV-sharing hybrids (Gemma): a global layer that is a KV-shared source is excluded from compaction yet is not sliding, so it was re-roped with the wrong theta -- desyncing the cache that the shared global layers attend to. Pick the re-rope theta from each layer's own type via is_sliding_layer (shared with physical_compressible_layers' is_full check) instead of from compressibility: sliding -> local theta, full-attention -> global theta. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Disable kv_compress flag on invalid rolling config validate_kv_compress zeroed trigger/target on an invalid config but left kv_compress=true, inconsistent with the explicit-disable path in parse_kv_compress_override. Set kv_compress=false here too so both disable paths leave the same state. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Remove NIAH KV-compress test fixtures and consumer test The MLX-fixture consumer test (test_kv_compress.cpp) and its generated fixtures were research artifacts. The self-contained free-functions test covers the KV-compress math without them. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Trim excessive comments across KV-compress and rope-table changes Cut comments that restate self-evident code; keep only non-obvious WHY (banker's rounding edge case, per-layer re-rope theta source-layer subtlety, fp16 rope-angle precompute rationale). Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Replace standalone rope-table test with one integrated optimize-pass test The standalone suite depended on gitignored transpiled weight bundles (skipped on CI). Replace it with a self-contained synthetic-graph test in test_optimize_gemma4_attention.py asserting precompute_rope_tables rewrites the runtime cos angle to an fp16 embedding-table lookup keyed by the position input, with positions past 2048 representable. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Drop redundant keydiff_score formula comment The s_i formula is already on the header declaration; keep only the double-accumulation rationale. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Rename test_kv_compress_free_functions.cpp to test_kv_compress.cpp The old test_kv_compress.cpp (the MLX-fixture consumer) was removed, freeing the canonical name for the self-contained unit tests. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Shorten KV-compress comments Trim multi-line comments to two focused lines and drop low-value restatements. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Hoist RoPE rotation table out of compaction row loops rope_rotate_row recomputed pow(theta, -2i/d) and cos/sin for every row. Factor a RopeRot (inv_freq via pow once, then cos/sin) so compaction and re-rope build it once: compact/un-rope reuse one inv_freq across rows, and the uniform-delta re-rope paths build the full cos/sin table a single time. Public API and per-row math are unchanged (bit-identical; 17/17 tests green). Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Shorten RopeRot comment Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Share one un-rope table across a compaction's keep-set scoring Keep-set scoring un-ropes every key row, but the un-rope rotation depends only on (position, head_dim, theta) -- identical across all compressible layers (all global, same theta) and all kv-heads. It was rebuilt per (layer, head). Build the per-position cos/sin table once (unrope_table) and thread it through keepsets_from_fp16/int8; compress_kv_cache_keydiff builds it on the first compressible layer and reuses it. Keep-sets are bit-identical (17/17 tests). A full 28-layer 4096->2048 compaction drops from ~650 ms to ~335 ms. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Vectorize int8 dequant in KV compaction Extract the int8 dequant (int8 * per-group scale) that was duplicated across compact_int8, rotate_int8_row, and the int8 keep-set fill into one dequant_row helper, and add a float32x4 NEON path (widen int8 -> float, multiply by the group scale). Bit-identical to scalar; ~1.5x on int8 keep-set scoring. A g_use_simd toggle (kv_set_simd) selects the scalar fallback so the new dequant_simd_matches_scalar test can compare both on a NEON build; non-ARM builds compile only the scalar path. fp16 rope/keydiff were left scalar -- clang -O3 already auto-vectorizes those double loops. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Renumber compaction via table composition (drop per-survivor trig) Compaction's renumber rotated each survivor by (rank - abs_pos), computing cos/sin per survivor -- the dominant cost of compact. Since re-rope by +rank is the conjugate of the shared un-rope table (same cos, negated sin), renumber becomes: un-rope by -abs_pos (table[abs_pos]) then re-rope by +rank (conjugate of table[rank]) -- two table lookups, no per-row trig. compact_fp16 and compact_int8 take the shared un-rope table model.cpp already builds; the theta overloads build it from the kept indices for callers/tests. Two rotations vs one differ by ~2 ULP double, well below fp16/int8 resolution, so the stored cache is unchanged. A 28-layer 4096->2048 fp16 compaction drops from ~153 ms to ~32 ms (full compaction ~327 -> ~203 ms). 18/18 tests pass. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Skip thinking-token cache strip on a compacted cache remove_thinking_tokens uses absolute token positions as cache row indices, but a rolling compaction renumbers rows to 0..B-1 and shrinks the cache. After a compaction the absolute ranges no longer map, so the strip decremented cache_total_seq_len_ while skipping the per-row memmove, desyncing the cache. Track a cache_renumbered_ flag (set in compress_kv_cache_keydiff, cleared in reset_cache) and early-return from remove_thinking_tokens when set; the caller still erases processed_tokens, and the bounded thinking rows linger until the next compaction evicts them. Only affects GEMMA4 + thinking after a compaction. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Preserve special tokens across KV-cache compaction KeyDiff rolling compaction could evict structural special tokens (BOS, end-of-turn, think delimiters) when they were not recent or distinctive, degrading multi-turn conversations. Force-keep them: a per-cache-row token map (cache_token_ids_) tracks the text decode path and is gathered through each compaction with the first compressed layer's head-0 keep-set (canonical view -- exact in head 0, best-effort elsewhere). compress_kv_cache_keydiff unions the positions whose token is special into Params::protect, which the keep-set reserves just after the sink. Special ids come from the new Tokenizer::special_token_ids() (captures turn/think tokens the config parser misses). Guarded by cache_token_ids_ sync + no media; gated by kv_compress_preserve_special (default on) / CACTUS_KV_PRESERVE_SPECIAL. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Tighten PR comments to non-obvious behavior only Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Grow the KV cache on demand and move it across the prefill handoff Transpiled decoders bake max_cache_seq_len (e.g. 262144 for qwen3-vl) and the runtime allocated the full buffer up front -- ~32 GB for a tiny prompt, almost all of it zeroed and untouched. Make the KV_CACHE_STATE buffer growable: it starts at 256 entries and doubles (relocating the int8 scales region, preserving stored rows) up to the baked ceiling, so RAM tracks actual occupancy. Sliding-window layers keep a fixed window-sized buffer; only global layers grow. The prefill->decode handoff now transfers buffer ownership (a cross-graph std::move via steal_cache_buffer) instead of copying into a second allocation, so both caches are never resident at once -- halving peak. The padded-prefill path truncates the moved cache and re-runs the last token instead of re-copying. copy_cache_states/ensure_cache_capacity are removed as dead. For qwen3-vl this takes a short prompt from ~32 GB to ~150 MB with byte-identical output, and a 5k-token needle prompt peaks ~1.1 GB. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Use a media-specific default auto-prompt in the chat test tool When an image/audio is attached without --prompt, the tool auto-sent the generic "Describe the attached input.", which some VLMs (e.g. qwen3-vl) degenerate into a repetition loop on. Pick the prompt by attached media ("Describe this image." / "Transcribe or respond to this audio."). Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Support full-context cache transpilation Add explicit cache context sizing for cached component transpilation so decoder graphs can reserve the model context length instead of deriving capacity from the capture prompt. Gemma4, Qwen, and LFM cached decode builders now share the same config-driven resolver while preserving their existing capture-derived fallback sizes when no model context field is available. For Gemma4, sliding-attention layers keep the configured sliding-window cache size while full-attention layers receive the requested full cache length. Add a retranspile path that reuses existing converted weights, wire the cache context option through the CLI and transpile entrypoint, and make lowering honor per-attention cache length metadata. Add token-file input support for exact-token live generation and a context-scaling benchmark tool that records CSV/JSONL summaries and plots. Cover the new CLI plumbing, retranspile behavior, context parsing, and per-layer cache metadata with focused tests. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Size Qwen chunked-prefill caches from the full context too Full-context transpilation wired the cache-length resolver into the Qwen decoder_step path but left the chunked-prefill block (decoder_prefill_chunk / decoder_media_step) on the old max(1024, prompt+512) formula. So a re-transpile lifted decoder_step's ceiling but the prefill components stayed at 1024, and prompts past 1024 tokens still overflowed the baked RoPE table during prefill. Resolve those components through _max_cache_seq_len as well. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Supply cloud-handoff args in the chunked-bundle-flags run test cmd_run forwards --confidence-threshold/--cloud-timeout-ms/--no-cloud-handoff (features on this branch), but the full-context test's args Namespace omitted them, raising AttributeError after the two were merged. Add the attributes. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Size LFM2 decoder_step cache from the full context too The LFM2 decoder_step path was the last component still deriving max_cache_seq_len from the capture prompt (max(1024, prompt+512)) rather than the model context; its chunked sibling already used the resolver. Route it through _max_cache_seq_len so LFM2 reserves the full context like the other families. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Carry max_position_embeddings in common graph meta for rope precompute Full-context rope-table precompute raises when a component has neither max_cache_seq_len nor max_position_embeddings in its meta. Non-cached components (e.g. LFM2's full-context decoder) set neither, so re-transpiling multimodal bundles failed at precompute. Seed common graph meta with the model's max_position_embeddings as the rope-table fallback. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Don't assert precision in the cache move handoff steal_cache_buffer asserted the destination and source buffers share a precision, but the destination is the step node's not-yet-executed baked buffer (INT8) while the source is the runtime prefill buffer, which is FP16 under CACTUS_KV_CACHE_FP16. The move replaces the destination buffer wholesale, so its precision follows the source; only op_type is invariant. This aborted every fp16-cache run at the handoff. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Drop comments the code already expresses Keep only the non-obvious rationale (the fp16 precision invariant in the cache move, the padded-prefill re-run, the rope-table meta fallback). Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Trim kv_compress.h comments to the load-bearing math Drop the declaration comments that restate the signature and tighten the rest to one line, keeping only what the code can't express: the KeyDiff score, the POST-RoPE rotation convention, the K/V renumber semantics, and the per-family compressible-layer selection. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Bake rope tables only into the lowered graph, not the saved IR optimize_graph baked the precomputed rope tables into the IR it returns, so the component pipeline's saved optimized_ir.json carried them too -- which the rope-table tests can't read back (the values live in graph.cactus, not the JSON). Gate precompute behind a flag and run it on the deepcopy the component pipeline lowers, leaving the saved optimized IR un-baked. graph.cactus is still baked; runtime is unchanged. Other transpile paths keep baking via the default. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Fix conv/recurrent cache handoff and guard MLA-dim compaction The prefill->decode cache move iterated key then value node ids; conv and recurrent caches serialize one node as both, so the second steal_cache_buffer moved the already-emptied source over the destination and blanked the buffer. Dedup the move so a shared node is transferred once. Also preflight-skip KeyDiff compaction when any compressible layer's V head dim differs from its K head dim (MLA-style): compaction strides both by one head_dim and would otherwise corrupt the value rows. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Preserve special tokens per-head across KV-cache compaction Special-token protect was built from the head-0-anchored cache_token_ids_ and applied as global positions, so after the first rolling compaction (which keeps per-head-divergent rows) non-head-0 heads protected the wrong rows and could drop a mid-context special across cycles. Track special rows per (compressible layer, head) in a SpecialRowTracker, rebuilt lazily from the still-accurate appended region of cache_token_ids_ and remapped through each head's keep set after compaction. Feed a per-head protect into keepsets so every head reserves its own specials; KeyDiff scoring stays per-head. Add a specials-fit preflight that skips the pass rather than letting a special be truncated, and disable tracking once an untracked compaction (e.g. media) diverges the heads. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Don't pass stale per-head protect when tracking is disabled keepsets received special_rows_.protect(li) unconditionally, so a compaction with per-head tracking off (media, preserve off, or post-invalidate) would still override Params::protect with stale tracker rows. Pass empty protect unless per_head_protect, clear the rows on invalidate(), and test the Params::protect fallback. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Trim per-head-protect comments to one line each Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Drop per-head-protect comments that restate the code Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Disable KV compaction for Gemma4 thinking mode Compaction renumbers the cache, which the Gemma4 thinking-token strip can't follow, so previous-turn thoughts leak once a compaction has run. Until the strip is compaction-aware, suppress compaction for thinking requests (set in do_prefill, checked in maybe_roll_compact); non-thinking and other models are unaffected. Gemma4 is the only model that strips prior thoughts. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Suppress compaction before the first-token fast path too The fast prefill_and_sample_first_token path (text-only, empty cache) bypasses do_prefill, where compaction_suppressed_ was set, so Gemma4 thinking requests on that path still compacted. Set the flag before the fast path as well. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * cleaned comment Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Shrink the KV cache after compaction to reclaim long-prefill capacity The growable cache only ever grew to peak occupancy and was never reclaimed, so a long prefill (moved into decode) left the buffer over-allocated at prompt length even after compaction dropped occupancy to target. After compacting each layer, shrink its buffer to the smallest power of two >= trigger, which holds the decode occupancy oscillation [target, trigger] without re-growing. Factor a shared resize_cache_buffer used by grow and the new shrink_cache_buffer. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Trim shrink/suppression comments to the load-bearing ones Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Keep Gemma-4 thinking in the KV cache like LiteRT Stop stripping generated thinking for Gemma-4 so it persists across turns as ordinary context (Qwen already does this). - Remove the in-cache strip path: strip_thinking_from_cache, Model::remove_thinking_tokens, find_channel_token_ranges, and the now-unread cache_renumbered_ member/writes. - Remove the compaction-suppression stopgap (compaction_suppressed_, set_compaction_suppressed, set sites, and the maybe_roll_compact guard) so compaction runs normally for thinking context. - format_gemma4_style keeps assistant content verbatim (drop strip_channel) so prior-turn thinking stays in the rendered prompt. - Rename strip_thinking_block -> partition_thinking_response: it now only splits generated text into (thinking, content) for the API output. - Emit a non-user-facing context_response (raw assistant text before partitioning) in the result JSON for conversation-history persistence. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Persist Gemma-4 thinking in chat surfaces; rewrite thinking test - chat.cpp and server.py store context_response (raw assistant text with thinking) as conversation history while still displaying the clean response, so thinking survives the next turn's re-render. - Rewrite test_gemma4_thinking for the new contract: the formatter retains assistant channel/thinking content, the API response stays clean, and re-rendered multi-turn history prefix-matches the cache. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Fix review: clean OpenAI response content; drop stale v1 thinking test server.py returned the raw context_response (channel tags) as public message content; return the clean response (context_response is only for stateful chat history). Delete the superseded v1 tests/test_gemma4_thinking.cpp, which still asserted the removed strip contract; the active test lives in cactus-engine/tests. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Remove stale iOS harness reference to deleted thinking test Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Remove channel-token config orphaned by the thinking-strip deletion find_channel_token_ranges was the only reader of channel_open/close_token_id; drop the now-unused config fields and their parse. Revert the one-line server.py refactor to the original inline form. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Reproduce Qwen thinking opener in history so KV cache reuses across turns Qwen's chat template injects a thinking opener (<think>\n, or <think>\n\n</think>\n\n when thinking is disabled) only in the generation-prompt branch. The model continues from that opener, so the opener tokens are stored in the KV cache as part of the assistant turn. Re-rendering that turn as history did not reproduce the opener, so the next turn's prompt diverged from the cached tokens at the assistant boundary, prompt_context_matches failed, and every turn did a full re-prefill. Emit the same opener per assistant history turn (selected from whether the content already closed </think>) so the re-templated history byte-matches the cache. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Add cache_context_length parameter to component specs functions * Shorten keepsets_from_fp16 comment to one line Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Remove gemma4 context-scaling benchmark script Standalone manual benchmark harness, not part of the test suite or referenced anywhere; out of scope for the PR. Signed-off-by: Noah Cylich <noahcylich@gmail.com> * Consolidate gemma4 thinking tests into test_llm Move the four gemma4 thinking tests into test_llm.cpp (next to the prefill cache-reuse tests) and delete the standalone suite. The three model-gated tests run only when the chosen model is Gemma4 and warn-skip otherwise; partition_thinking_response is a pure unit test that always runs. Share a load_gemma4_or_skip helper instead of repeating the gate. Signed-off-by: Noah Cylich <noahcylich@gmail.com> --------- Signed-off-by: Noah Cylich <noahcylich@gmail.com> Co-authored-by: jakmro <kubamroz124@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Enables chunked prefill for plain text-only causal LLMs (qwen3 family). These previously transpiled to only a monolithic
decoder+decoder_step, so the engine took theDIRECT_DECODER_STEProute and prefilled token-by-token (~27 tok/s). This emits the chunked-prefill component pipeline (lm_encoder_step,lm_encoder_text_chunk,decoder_media_step,decoder_prefill_chunk) via text-only embed/chunk adapters mirroring the multimodal qwen3.5 ones (minus vision), so the existing chunked-prefill engine route engages.Result
Safety
decoder+decoder_step) and multimodal families are unchanged — the chunk pipeline is only emitted when those components are requested; transpiler default tests pass.Related
Existing chunked-prefill branches (
v2-chunked-prefill,mtp-component-chunked-prefill) may overlap — this is a focused text-qwen3 addition; worth reconciling with that work.