Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
60ef7ba
Support chunked prefill for text-only causal LLMs
ncylich Jun 3, 2026
d2a12bd
Add KeyDiff query-agnostic KV-cache compression
ncylich Jun 3, 2026
10cbac8
Precompute RoPE cos/sin tables to fix long-context precision
ncylich Jun 4, 2026
135a584
Fix chunked-prefill sliding-window KV cache corruption
ncylich Jun 4, 2026
6820973
Make rolling KV compaction the default for causal LLMs
ncylich Jun 4, 2026
d84b08b
Compact global + re-rope sliding-window layers at compaction (engine-…
ncylich Jun 4, 2026
b17c723
Merge #687: chunked prefill for text-only causal LLMs
ncylich Jun 4, 2026
80a129c
Default chunked prefill for text qwen3 conversions
ncylich Jun 4, 2026
20b596a
Skip non-attention caches in KV compaction (fix LFM2/hybrid corruption)
ncylich Jun 4, 2026
7427f59
Merge origin/main into kv-compress-keydiff
ncylich Jun 4, 2026
5463bd5
Re-rope KV-shared global source layers with global theta
ncylich Jun 4, 2026
de45aae
Disable kv_compress flag on invalid rolling config
ncylich Jun 4, 2026
5dbe288
Remove NIAH KV-compress test fixtures and consumer test
ncylich Jun 4, 2026
f603b4c
Trim excessive comments across KV-compress and rope-table changes
ncylich Jun 4, 2026
7574363
Replace standalone rope-table test with one integrated optimize-pass …
ncylich Jun 4, 2026
58842d2
Drop redundant keydiff_score formula comment
ncylich Jun 4, 2026
0211413
Rename test_kv_compress_free_functions.cpp to test_kv_compress.cpp
ncylich Jun 4, 2026
28ab27e
Shorten KV-compress comments
ncylich Jun 4, 2026
10a923b
Hoist RoPE rotation table out of compaction row loops
ncylich Jun 4, 2026
db17e67
Shorten RopeRot comment
ncylich Jun 4, 2026
9f384c0
Share one un-rope table across a compaction's keep-set scoring
ncylich Jun 4, 2026
8a20978
Vectorize int8 dequant in KV compaction
ncylich Jun 4, 2026
304b865
Renumber compaction via table composition (drop per-survivor trig)
ncylich Jun 4, 2026
d16cf73
Skip thinking-token cache strip on a compacted cache
ncylich Jun 4, 2026
2a61035
Preserve special tokens across KV-cache compaction
ncylich Jun 5, 2026
4c8127d
Tighten PR comments to non-obvious behavior only
ncylich Jun 5, 2026
4390d58
Grow the KV cache on demand and move it across the prefill handoff
ncylich Jun 6, 2026
659c22f
Use a media-specific default auto-prompt in the chat test tool
ncylich Jun 6, 2026
8627813
Support full-context cache transpilation
ncylich May 28, 2026
85da4e5
Size Qwen chunked-prefill caches from the full context too
ncylich Jun 6, 2026
7f0e854
Supply cloud-handoff args in the chunked-bundle-flags run test
ncylich Jun 6, 2026
aed4967
Size LFM2 decoder_step cache from the full context too
ncylich Jun 6, 2026
7309a1f
Carry max_position_embeddings in common graph meta for rope precompute
ncylich Jun 6, 2026
6ef5056
Don't assert precision in the cache move handoff
ncylich Jun 6, 2026
eaabcf8
Drop comments the code already expresses
ncylich Jun 6, 2026
553db06
Trim kv_compress.h comments to the load-bearing math
ncylich Jun 6, 2026
37e5af4
Bake rope tables only into the lowered graph, not the saved IR
ncylich Jun 6, 2026
88cf4e7
Fix conv/recurrent cache handoff and guard MLA-dim compaction
ncylich Jun 6, 2026
40a5547
Preserve special tokens per-head across KV-cache compaction
ncylich Jun 6, 2026
f8a9446
Don't pass stale per-head protect when tracking is disabled
ncylich Jun 6, 2026
6cd45c8
Trim per-head-protect comments to one line each
ncylich Jun 6, 2026
f849932
Drop per-head-protect comments that restate the code
ncylich Jun 6, 2026
19a657d
Disable KV compaction for Gemma4 thinking mode
ncylich Jun 6, 2026
5f864bb
Suppress compaction before the first-token fast path too
ncylich Jun 6, 2026
1f98ccf
cleaned comment
ncylich Jun 6, 2026
c391c53
Shrink the KV cache after compaction to reclaim long-prefill capacity
ncylich Jun 6, 2026
59e8742
Trim shrink/suppression comments to the load-bearing ones
ncylich Jun 6, 2026
0bfe84d
Keep Gemma-4 thinking in the KV cache like LiteRT
ncylich Jun 7, 2026
524fe9c
Persist Gemma-4 thinking in chat surfaces; rewrite thinking test
ncylich Jun 7, 2026
48f270d
Fix review: clean OpenAI response content; drop stale v1 thinking test
ncylich Jun 7, 2026
2b33371
Remove stale iOS harness reference to deleted thinking test
ncylich Jun 7, 2026
dfbefff
Remove channel-token config orphaned by the thinking-strip deletion
ncylich Jun 7, 2026
ac780c9
Reproduce Qwen thinking opener in history so KV cache reuses across t…
ncylich Jun 7, 2026
2e8a525
Merge remote-tracking branch 'origin/main' into kv-compress-keydiff
ncylich Jun 8, 2026
ab128ba
Add cache_context_length parameter to component specs functions
jakmro Jun 8, 2026
211e5a2
Shorten keepsets_from_fp16 comment to one line
ncylich Jun 8, 2026
4ea057e
Remove gemma4 context-scaling benchmark script
ncylich Jun 8, 2026
b374390
Consolidate gemma4 thinking tests into test_llm
ncylich Jun 8, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions cactus-engine/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ set(ENGINE_SOURCES
src/sp.cpp
src/constraints.cpp
src/model.cpp
src/kv_compress.cpp
src/model_npu.cpp
src/engine_image.cpp
src/index.cpp
Expand Down
25 changes: 2 additions & 23 deletions cactus-engine/src/complete.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -77,23 +77,6 @@ std::vector<ToolConstraintSpec> build_tool_constraint_specs(const std::vector<To
return specs;
}

void strip_thinking_from_cache(CactusModelHandle* handle,
const std::vector<uint32_t>& generated_tokens,
size_t prompt_len) {
const auto& cfg = handle->model->get_config();
uint32_t open_id = cfg.channel_open_token_id;
uint32_t close_id = cfg.channel_close_token_id;
auto ranges = find_channel_token_ranges(generated_tokens, prompt_len,
open_id, close_id);
if (ranges.empty()) return;

handle->model->remove_thinking_tokens(ranges);
for (auto it = ranges.rbegin(); it != ranges.rend(); ++it) {
auto start = handle->processed_tokens.begin() + it->first;
handle->processed_tokens.erase(start, start + it->second);
}
}

void setup_tool_constraints(CactusModelHandle* handle, const std::vector<ToolFunction>& tools,
bool force_tools, float& temperature) {
if (!force_tools || tools.empty()) return;
Expand Down Expand Up @@ -918,10 +901,6 @@ int cactus_complete(
handle->model->clear_tool_constraints();
}

if (prompt.model_type == Config::ModelType::GEMMA4 && prompt.options.enable_thinking_if_supported && !generated_tokens.empty()) {
strip_thinking_from_cache(handle, generated_tokens, prompt.tokens.size());
}

auto end_time = std::chrono::high_resolution_clock::now();
double total_time_ms = std::chrono::duration_cast<std::chrono::microseconds>(end_time - start_time).count() / 1000.0;

Expand All @@ -939,7 +918,7 @@ int cactus_complete(
std::string thinking_text;
if (prompt.model_type == Config::ModelType::GEMMA4 || prompt.options.enable_thinking_if_supported) {
std::string stripped_content;
strip_thinking_block(regular_response, thinking_text, stripped_content);
partition_thinking_response(regular_response, thinking_text, stripped_content);
regular_response = stripped_content;
if (!prompt.options.enable_thinking_if_supported) {
thinking_text.clear();
Expand Down Expand Up @@ -982,7 +961,7 @@ int cactus_complete(
std::string result = construct_response_json(primary_response, primary_function_calls, time_to_first_token,
total_time_ms, prefill_tps, decode_tps, prompt_tokens,
completion_tokens, confidence, handoff_succeeded,
thinking_text);
thinking_text, {}, response_text);

if (result.length() >= buffer_size) {
handle_error_response("Response buffer too small", response_buffer, buffer_size);
Expand Down
38 changes: 34 additions & 4 deletions cactus-engine/src/engine.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
#include <limits>

#include "cactus_graph.h"
#include "kv_compress.h"

class CactusGraph;

Expand Down Expand Up @@ -172,6 +173,15 @@ struct Config {
std::vector<std::string> layer_types;
size_t conv_L_cache = 0;

// Rolling bounded KV compaction (default ON, 4096 -> 2048). Override at runtime with
// CACTUS_KV_COMPRESS_AT (trigger) / CACTUS_KV_COMPRESS_TO (target); CACTUS_KV_COMPRESS_AT=0 disables.
bool kv_compress = true;
float kv_compress_recent_frac = 0.30f;
uint32_t kv_compress_sink = 4;
int32_t kv_compress_trigger_len = 4096;
int32_t kv_compress_target_len = 2048;
bool kv_compress_preserve_special = true;

uint32_t altup_num_inputs = 4;
uint32_t laurel_rank = 64;
static constexpr uint32_t UNSET_U32 = UINT32_MAX;
Expand Down Expand Up @@ -219,15 +229,16 @@ struct Config {
uint32_t audio_fft_length = 1024;
uint32_t audio_token_id = 0;
bool audio_fft_overdrive = false;
uint32_t channel_open_token_id = 100;
uint32_t channel_close_token_id = 101;

static bool is_gemma_family(ModelType t) {
return t == ModelType::GEMMA || t == ModelType::GEMMA3N || t == ModelType::GEMMA4;
}

bool from_json(const std::string& json_path);
std::string to_json() const;
// Disable rolling unless 0 < target < trigger (when trigger > 0).
void validate_kv_compress();
bool parse_kv_compress_override(const char* trigger_env, const char* target_env);
};


Expand Down Expand Up @@ -316,6 +327,7 @@ class Tokenizer {
virtual uint32_t get_unk_token() const = 0;
virtual uint32_t get_bos_token() const = 0;
virtual uint32_t get_eos_token() const = 0;
virtual std::unordered_set<uint32_t> special_token_ids() const { return {}; }
virtual bool has_chat_template() const { return has_chat_template_; }
std::string get_default_stop_sequence() const;

Expand Down Expand Up @@ -370,6 +382,11 @@ class BPETokenizer : public Tokenizer {
uint32_t get_unk_token() const override { return unk_token_id_; }
uint32_t get_bos_token() const override { return bos_token_id_; }
uint32_t get_eos_token() const override { return eos_token_id_; }
std::unordered_set<uint32_t> special_token_ids() const override {
std::unordered_set<uint32_t> ids;
for (const auto& kv : special_tokens_) ids.insert(kv.second);
return ids;
}

private:
std::unordered_map<std::string, uint32_t> token_to_id_;
Expand Down Expand Up @@ -422,6 +439,11 @@ class SPTokenizer : public Tokenizer {
uint32_t get_unk_token() const override { return unk_token_id_; }
uint32_t get_bos_token() const override { return bos_token_id_; }
uint32_t get_eos_token() const override { return eos_token_id_; }
std::unordered_set<uint32_t> special_token_ids() const override {
std::unordered_set<uint32_t> ids;
for (const auto& kv : special_tokens_) ids.insert(kv.second);
return ids;
}

private:
struct TrieNode {
Expand Down Expand Up @@ -618,9 +640,13 @@ class Model {
bool load_npu_vision_encoder(const std::string& model_path);
bool has_npu_vision_encoder() const { return npu_vision_encoder_ != nullptr; }

void remove_thinking_tokens(const std::vector<std::pair<size_t, size_t>>& ranges);
void compact_kv_cache() {}

void compress_kv_cache_keydiff(const cactus::kvcompress::Params& params);
void maybe_roll_compact();
std::vector<size_t> compressible_layers() const;
void apply_kv_compress_env_override();

void set_tool_constraints(const std::vector<ToolConstraintSpec>& tools);
void clear_tool_constraints();
void update_tool_constraints(uint32_t token_id);
Expand Down Expand Up @@ -685,7 +711,8 @@ class Model {
void copy_component_outputs_to_chunk_inputs(const Component& source, Component& target, size_t token_index);
void copy_component_outputs_to_chunk_inputs_range(const Component& source, Component& target, size_t token_offset);
bool cache_states_compatible(const Component& source, const Component& target) const;
void copy_cache_states(const Component& source, Component& target, size_t logical_current = std::numeric_limits<size_t>::max());
void move_cache_states(Component& source, Component& target, size_t logical_current = std::numeric_limits<size_t>::max());
void set_cache_current_len(Component& comp, size_t len);
void reset_component_cache_states(Component& comp);
size_t component_chunk_tokens(const Component& comp, const std::string& input_name) const;
size_t component_output_tokens(const Component& comp, const std::string& output_name) const;
Expand Down Expand Up @@ -748,6 +775,9 @@ class Model {
std::unique_ptr<Tokenizer> tokenizer_;
bool initialized_ = false;
size_t cache_total_seq_len_ = 0;
std::vector<uint32_t> cache_token_ids_; // token id per cache row (canonical head-0 view)
std::unordered_set<uint32_t> special_ids_; // special-token ids force-kept during compaction
cactus::kvcompress::SpecialRowTracker special_rows_; // per-(layer,head) special rows for compaction protect
size_t cache_max_seq_len_ = 4096;
size_t last_logit_position_ = 0;
double last_prefill_cache_copy_ms_ = 0.0;
Expand Down
Loading
Loading