Skip to content

Commit 60f70ef

Browse files
mialsoolyasirgithub-code-quality[bot]gianni-cor
authored
QVAC-16769 feat[bc]: tool calls chaining compact (#1379)
* feat: anchored tools placement for multi-round tool chains Replace tools-at-end placement with anchored placement: tools are positioned after the last user message and stay in the KV cache across chain rounds instead of being removed and re-added each round. Changes: - Template: anchor tools after last user message (two-pass Jinja2) - PostInfer: keep tools when output contains <tool_call>, remove only when chain completes (no tool call in output) - Boundary tracking: recordToolBoundary sets anchor once, preserves across chain rounds - Streaming: capture output when toolsAtEnd is active for tool call detection - Stats: forward nPastBeforeTools, firstMsgTokens, toolsTrimmed - Generation prompt: treat role "tool" same as "user" for add_generation_prompt (fixes empty response on tool chain continuation) * fix: prevent output duplication in streaming mode with toolsAtEnd Use captured output only for internal tool call detection, don't set it as the return value when streaming. Prevents the JobRunner from queuing the full text again after it was already streamed token by token, which caused the SDK to see every tool call twice. * fix: avoid unnecessary string copy for non-tool completions Move captured output construction inside the toolsAtEnd guard so non-tool completions pay zero string overhead. Only the oss.str() call and tool_call detection happen when dynamic tools are active. * fix: context sliding with tools_at_end corrupts tool boundary tracking When context sliding occurs with tools_at_end enabled, the nPastBeforeTools boundary was not adjusted after token discard. This left stale tool tokens in the KV cache, causing incorrect trim after generation. Changes: - Limit discard to conversation-only region (never eat tool tokens) - Adjust nPastBeforeTools after sliding by the discard delta - Reset DynamicToolsState in fallback discard path - Applied to both TextLlmContext and MtmdLlmContext - Add regression test for sliding during generation with large tools * refactor: extract sliding helpers into DynamicToolsState, harden edge cases - Extract clampDiscard() and adjustAfterSlide() into DynamicToolsState to eliminate 4x duplicated clamping/adjustment blocks - Remove redundant std::max(safeLimit, 0) — guard already ensures > 0 - Add discard == 0 early return in applyContextDiscard to skip no-op KV cache operations - Guard fallback reset() with toolsAtEnd() check for consistency - Add comment explaining eval vs generation fallback asymmetry - Use n_predict=-2 (fill context) in test to guarantee sliding * test: update sliding test for anchored tools behavior With anchored tools, postInfer keeps tools in cache when the model produces <tool_call> in output. Update the sliding regression test to check toolsTrimmed stat instead of assuming tools are always removed after generation. * test: two-phase sliding test verifies adjustAfterSlide Replace single-phase sliding test with two-phase comparison: Phase 1 (baseline): large context, n_predict=0 → no sliding. Records nPastBeforeTools as the original anchor. Phase 2 (sliding): small context, n_predict=-2 → sliding fires. After trim, nPastBeforeTools must be less than baseline. Without adjustAfterSlide: both phases have equal nPastBeforeTools → FAIL. With adjustAfterSlide: phase 2 anchor is smaller → PASS. * test: exact sliding anchor assertion with session and clamped discard Three-phase test using session cache: Phase 1: init session (small firstMsgTokens) Phase 2: baseline — large context, n_predict=0, records anchor Phase 3: sliding — small context, n_predict=-2, sliding fires Simulates per-slide clamped discard (min(nDiscarded, safeLimit)) and asserts slideNPBT == expectedNPBT with exact values. Verifies adjustAfterSlide reduces anchor by the correct amount per slide. * test: add unclamped sliding test with long conversation Second sliding test with longer user message and smaller n_discarded (20). Verifies at least 1 slide discards the full n_discarded amount (unclamped). Both tests simulate per-slide clamped discard and assert exact nPastBeforeTools values. * test: use n_discarded=100 with long conversation for unclamped sliding Longer user message (~300 tokens) ensures the conversation region exceeds n_discarded=100. Each slide discards the full 100 tokens without clamping. Simpler and more direct than using small n_discarded. * fix: don't add generation prompt on system-only prefill When nPast=0 and the only message is a system prompt (role=system), don't set add_generation_prompt=true. This was adding a stale <|im_start|>assistant token to the cache that the model would see as an empty assistant turn before the actual user message. Now check the actual last message role instead of hardcoding true. Saves 3 tokens in the cache prefix. * chore: remove debug prompt logging * chore: add debug log for tokenizeChat generation prompt flag Logs nPast, lastRole, nMsgs, nTools, addGenPrompt at DEBUG verbosity. Helps diagnose issues with stale generation prompt in cache. * (fix) llamacpp-llm: "tool" role generate prompt tests * (fix) llamacpp-llm: no "think" blocks in assistant history * (internal) llamacpp-llm: test qwen3 dynamic tools template * (chore) llamacpp-llm: upgrade package version * fix: skip dispatch validation when called via workflow_call The Validate Dispatch Inputs step fails when the mobile integration workflow is invoked via workflow_call from a workflow_dispatch parent, because github.event.inputs.package is empty in that context. * fix: align prebuild download path with verify step in LLM mobile workflow Prebuilds are downloaded to runner.temp/qvac-lib-infer-llamacpp-llm but the verify step looked in runner.temp/prebuilds-download, so prebuilds were never found. * (internal) llamacpp-llm: runtimeDebugStats internal method * (chore) llamacpp-llm: tools_at_end rename to tools_compact * (improvement) llamacpp-llm: tools_compact feature docs * (chore) llamacpp-llm: fix test * (chore) llamacpp-llm: rename, cleanup, tests assertions * (internal) llamacpp-llm: improve tests * (internal) llamacpp-llm: reduce test flakiness with 0 temp * (internal) llamacpp-llm: test rename * (internal) llamacpp-llm: generate tests correct * (internal) llamacpp-llm: improve sliding ctx tests * (chore) llamacpp-llm: version bump * (chore) llamacpp-llm: clang-format * (fix) llamacpp-llm: qwen3 template perf and debug null guard * (chore) llamacpp-llm: discard tokens warning * (chore) llamacpp-llm: reuse getStatValue at tests * (fix) llamacpp-llm: first msg sliding guard * (improvement) llamacpp-llm: tools_compact require tools always * (chore) llamacpp-llm: fix linter * (fix) llamacpp-llm: guard regression, integration tests * (internal) llamacpp-llm: remove over-defensive checks, fix test * (chore) llamacpp-llm: cleanup linter and unused tests * refactoring: anchored tools structured (#1658) * (doc) llamacpp-llm: structure proposal * (doc) llamacpp-llm: refactoring plan * (internal) llamacpp-llm: extract tools compact controller from llm contexts * (internal) llamacpp-llm: extract shared context slider for text and mtmd * (internal) llamacpp-llm: ContextSlider testable, more tests * (internal) llamacpp-llm: migrate tools compact coverage to deterministic unit tests * (chore) llamacpp-llm: follow up minor fixes * (internal) llamacpp-llm: improve multi-model portability * (internal) llamacpp-llm: decouple ChatTemplateUtils * (internal) llamacpp-llm: tools_compact contract, tests * (internal) llamacpp-llm: ToolsCompactController tests and comments * (doc) llamacpp-llm: tools_compact refine verify * (internal) llamacpp-llm: tools compact profile resolution improved * (chore) llamacpp-llm: clang format * (chore) llamacpp-llm: tools-compact test improved * (chore) llamacpp-llm: test conditin check style Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * (chore) llamacpp-llm: bump version, remove nested namespace * (chore) llamacpp-llm: changelog improved * (chore) llamacpp-llm: cleanup, test tool token count comment * (chore) llamacpp-llm: tests useless conditional Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> * (chore) llamacpp-llm: tests refactor and remove redundant * (chore) llamacpp-llm: deduplicate cache management tests, context slider edge coverage * (chore) llamacpp-llm: clang format * (fix) llamacpp-llm: ToolsCompact tools_calls check * (internal) llamacpp-llm: oss string handle optimization * (internal) llamacpp-llm: compute user msg index at cpp * Revert "(internal) llamacpp-llm: compute user msg index at cpp" This reverts commit 872eb47. * (internal) llamacpp-llm: qwen3 dynamic template loop perf improved * (chore) llamacpp-llm: clang format --------- Co-authored-by: olyasir <sirkinolya@gmail.com> Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com> Co-authored-by: gianni-cor <gianfranco.cordella@tether.io>
1 parent 8531289 commit 60f70ef

38 files changed

Lines changed: 3171 additions & 1337 deletions

packages/qvac-lib-infer-llamacpp-llm/CHANGELOG.md

Lines changed: 56 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,61 @@
11
# Changelog
22

3+
## [0.17.0] - 2026-04-21
4+
5+
### Changed
6+
7+
#### `tools_at_end` renamed to `tools_compact`
8+
9+
**Breaking**: The `tools_at_end` configuration option has been renamed to `tools_compact`. The old key is no longer recognized.
10+
11+
#### Anchored tool placement for multi-round tool chains
12+
13+
Tools are now anchored after the **last user message** (via a two-pass Jinja2 template that tracks `last_user_idx`) instead of being appended at the very end of the prompt. The tool boundary is set once on the first round and preserved across chain rounds, so tools stay in the KV cache while the model is still calling tools. Trimming now only happens when the chain completes (output contains no `<tool_call>` tag), instead of after every turn.
14+
15+
This eliminates redundant tokenize → eval → trim cycles during multi-round tool chains and matches the model's expected prompt layout more closely.
16+
17+
#### `<think>` blocks stripped from assistant history
18+
19+
The Qwen3 tools-dynamic template no longer re-injects `<think>…</think>` reasoning blocks into assistant history. Prior assistant messages are replayed with the thinking content stripped, which reduces token waste and avoids the model treating stale reasoning as context.
20+
21+
#### `tools_compact` prompt-shape validation tightened
22+
23+
`tools_compact` now validates prompt layout before inference and fails fast with `InvalidArgument` for malformed inputs (for example: required tools omitted, non-contiguous tool block, tools not attached to the last user/tool anchor, or tools not placed at the end).
24+
25+
### Fixed
26+
27+
#### Context sliding with `tools_compact` could corrupt tool boundary tracking
28+
29+
When context sliding (token discard) occurred during generation or prefill with `tools_compact` enabled, the `nPastBeforeTools` boundary could become stale. This caused post-generation trim to remove the wrong tail region and could leave tool tokens in the KV cache across turns.
30+
31+
Sliding is now centralized through `ContextSlider` + `ToolsCompactController`:
32+
- `clampDiscard()` caps discard so sliding never crosses into protected tool tokens
33+
- `onSlide()` keeps `nPastBeforeTools` aligned after each slide
34+
- Fallback full-wipe paths reset controller state to avoid stale boundaries
35+
- Applied consistently in both `TextLlmContext` and `MtmdLlmContext`
36+
37+
#### Output duplication in streaming mode with `tools_compact`
38+
39+
In streaming mode the captured output buffer was being returned as the final result, causing the SDK to see every token twice (once streamed, once in the result). The captured buffer is now used only for internal `<tool_call>` detection.
40+
41+
#### Generation prompt added on system-only prefill
42+
43+
When `nPast=0` and the only message was a system prompt, `add_generation_prompt` was hardcoded to `true`, injecting a stale `<|im_start|>assistant` token into the cache. Now checks the actual last message role.
44+
45+
#### `"tool"` role not treated as turn-ending for generation prompt
46+
47+
Messages with role `"tool"` (tool call results) were not triggering `add_generation_prompt`, causing empty responses on tool chain continuation. Now treated the same as `"user"` for generation prompt purposes.
48+
49+
#### Empty chat message array now fails with `EmptyPrompt`
50+
51+
`tokenizeChat()` now throws `StatusError(EmptyPrompt)` when called with no chat messages, making empty prompt handling explicit and consistent for both text and multimodal contexts.
52+
53+
### Added
54+
55+
- `runtimeDebugStats()` internal method on `LlamaModel` exposing `nPastBeforeTools`, `firstMsgTokens`, and `toolsTrimmed`
56+
- Comprehensive C++ unit tests for Qwen3 tools-dynamic template and cache management with tools_compact
57+
- Regression tests for context sliding with anchored tools: clamped discard, anchor updates after slide, unclamped sliding with long conversations, and sliding during generation
58+
359
## [0.16.0] - 2026-04-14
460

561
This release migrates the LLM addon off `BaseInference` inheritance and the `WeightsProvider` download layer onto the composable `createJobHandler` + `exclusiveRunQueue` utilities from `@qvac/infer-base@^0.4.0`. The constructor signature is replaced with a single object whose `files.model` field is an ordered array of absolute paths and `files.projectionModel` is an optional absolute path for multimodal models. This is a breaking change — every caller must update.

packages/qvac-lib-infer-llamacpp-llm/CMakeLists.txt

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -64,11 +64,13 @@ endif()
6464
list(APPEND ADDON_SOURCES
6565
${PROJECT_SOURCE_DIR}/addon/src/model-interface/AsyncWeightsLoader.cpp
6666
${PROJECT_SOURCE_DIR}/addon/src/model-interface/CacheManager.cpp
67+
${PROJECT_SOURCE_DIR}/addon/src/model-interface/ContextSlider.cpp
6768
${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaLazyInitializeBackend.cpp
6869
${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaModel.cpp
6970
${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaFinetuningHelpers.cpp
7071
${PROJECT_SOURCE_DIR}/addon/src/model-interface/MtmdLlmContext.cpp
7172
${PROJECT_SOURCE_DIR}/addon/src/model-interface/TextLlmContext.cpp
73+
${PROJECT_SOURCE_DIR}/addon/src/model-interface/ToolsCompactController.cpp
7274
${PROJECT_SOURCE_DIR}/addon/src/model-interface/ModelMetadata.cpp
7375
${PROJECT_SOURCE_DIR}/addon/src/utils/LoggingMacros.cpp
7476
${PROJECT_SOURCE_DIR}/addon/src/utils/BackendSelection.cpp
@@ -110,10 +112,12 @@ if(BUILD_CLI)
110112
${PROJECT_SOURCE_DIR}/addon/src/cli/cli_tool.cc
111113
${PROJECT_SOURCE_DIR}/addon/src/model-interface/AsyncWeightsLoader.cpp
112114
${PROJECT_SOURCE_DIR}/addon/src/model-interface/CacheManager.cpp
115+
${PROJECT_SOURCE_DIR}/addon/src/model-interface/ContextSlider.cpp
113116
${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaLazyInitializeBackend.cpp
114117
${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaModel.cpp
115118
${PROJECT_SOURCE_DIR}/addon/src/model-interface/MtmdLlmContext.cpp
116119
${PROJECT_SOURCE_DIR}/addon/src/model-interface/TextLlmContext.cpp
120+
${PROJECT_SOURCE_DIR}/addon/src/model-interface/ToolsCompactController.cpp
117121
${PROJECT_SOURCE_DIR}/addon/src/model-interface/ModelMetadata.cpp
118122
${PROJECT_SOURCE_DIR}/addon/src/utils/LoggingMacros.cpp
119123
${PROJECT_SOURCE_DIR}/addon/src/utils/BackendSelection.cpp

packages/qvac-lib-infer-llamacpp-llm/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -173,7 +173,7 @@ const config = {
173173
| presence_penalty | float | 0 | Presence penalty for sampling |
174174
| frequency_penalty | float | 0 | Frequency penalty for sampling |
175175
| tools | `"true"` or `"false"` | `"false"` | Enable tool calling with jinja templating |
176-
| tools_at_end | `"true"` or `"false"` | `"false"` | Place tools at end of prompt ([details](./docs/tools-at-end.md)) |
176+
| tools_compact | `"true"` or `"false"` | `"false"` | Compact tool tokens from KV cache between turns ([details](./docs/tools-compact.md)) |
177177
| verbosity | 0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG) | 0 | Logging verbosity level |
178178
| n_discarded | integer | 0 | Tokens to discard in sliding window context |
179179
| main-gpu | integer, `"integrated"`, or `"dedicated"` || GPU selection for multi-GPU systems |
@@ -315,7 +315,7 @@ npm run quickstart
315315
- [LoRA Finetuning Pause/Resume](./examples/finetune/simple-lora-finetune-pause-resume.js) – Pause and resume finetuning.
316316
- [LoRA Inference](./examples/simple-lora-inference.js) – Inference with a finetuned LoRA adapter.
317317
- [Smart Home Finetune Showcase](./examples/finetune/showcase/smart-home-finetune.js) – Train a smart home tool-calling specialist, then [evaluate](./examples/finetune/showcase/smart-home-finetuned-test.js) baseline vs finetuned.
318-
- [Bench Tools Placement](./examples/benchToolsPlacement.js) – Benchmarks standard vs `tools_at_end` placement across multi-turn conversations.
318+
- [Bench Tools Placement](./examples/benchToolsPlacement.js) – Benchmarks standard vs `tools_compact` placement across multi-turn conversations.
319319
- [Test Tool Removal](./examples/testToolRemoval.js) – Demonstrates dynamic tool addition and removal between turns.
320320

321321
## OCR with Vision-Language Models

packages/qvac-lib-infer-llamacpp-llm/addon/src/model-interface/CacheManager.cpp

Lines changed: 3 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -29,17 +29,11 @@ bool CacheManager::isFileInitialized(const std::filesystem::path& path) {
2929
}
3030

3131
bool CacheManager::handleCache(
32-
std::vector<common_chat_msg>& chatMsgs,
33-
std::vector<common_chat_tool>& tools, const std::string& inputPrompt,
34-
std::function<
35-
std::pair<std::vector<common_chat_msg>, std::vector<common_chat_tool>>(
36-
const std::string&)>
37-
formatPrompt,
32+
ParsedPromptPayload& parsedPrompt, const std::string& inputPrompt,
33+
std::function<ParsedPromptPayload(const std::string&)> formatPrompt,
3834
const std::string& cacheKey) {
3935

40-
auto formatted = formatPrompt(inputPrompt);
41-
chatMsgs = std::move(formatted.first);
42-
tools = std::move(formatted.second);
36+
parsedPrompt = formatPrompt(inputPrompt);
4337

4438
if (cacheKey.empty()) {
4539
if (hasActiveCache()) {

packages/qvac-lib-infer-llamacpp-llm/addon/src/model-interface/CacheManager.hpp

Lines changed: 9 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,21 +8,24 @@
88
#include <llama.h>
99

1010
#include "LlmContext.hpp"
11+
#include "ToolsCompactController.hpp"
1112
#include "common/chat.h"
1213

14+
struct ParsedPromptPayload {
15+
std::vector<common_chat_msg> chatMsgs;
16+
std::vector<common_chat_tool> tools;
17+
PromptLayout layout;
18+
};
19+
1320
class CacheManager {
1421
public:
1522
CacheManager(
1623
LlmContext* llmContext, llama_pos configuredNDiscarded,
1724
std::function<void(bool)> resetStateCallback);
1825

1926
bool handleCache(
20-
std::vector<common_chat_msg>& chatMsgs,
21-
std::vector<common_chat_tool>& tools, const std::string& inputPrompt,
22-
std::function<std::pair<
23-
std::vector<common_chat_msg>, std::vector<common_chat_tool>>(
24-
const std::string&)>
25-
formatPrompt,
27+
ParsedPromptPayload& parsedPrompt, const std::string& inputPrompt,
28+
std::function<ParsedPromptPayload(const std::string&)> formatPrompt,
2629
const std::string& cacheKey = "");
2730

2831
bool loadCache();
Lines changed: 137 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,137 @@
1+
#include "ContextSlider.hpp"
2+
3+
#include "ToolsCompactController.hpp"
4+
#include "common/common.h"
5+
#include "qvac-lib-inference-addon-cpp/Logger.hpp"
6+
#include "utils/LoggingMacros.hpp"
7+
8+
using namespace qvac_lib_inference_addon_cpp::logger;
9+
10+
namespace {
11+
class ContextSliderOps final : public IContextSliderOps {
12+
public:
13+
llama_pos nCtx(llama_context* lctx) const override {
14+
return static_cast<llama_pos>(llama_n_ctx(lctx));
15+
}
16+
17+
ContextSliderMemoryHandle memory(llama_context* lctx) const override {
18+
return llama_get_memory(lctx);
19+
}
20+
21+
void seqRm(
22+
ContextSliderMemoryHandle mem, llama_seq_id seqId, llama_pos startPos,
23+
llama_pos endPos) const override {
24+
llama_memory_seq_rm(mem, seqId, startPos, endPos);
25+
}
26+
27+
void seqAdd(
28+
ContextSliderMemoryHandle mem, llama_seq_id seqId, llama_pos startPos,
29+
llama_pos endPos, llama_pos delta) const override {
30+
llama_memory_seq_add(mem, seqId, startPos, endPos, delta);
31+
}
32+
};
33+
} // namespace
34+
35+
const IContextSliderOps& defaultContextSliderOps() {
36+
static const ContextSliderOps ops;
37+
return ops;
38+
}
39+
40+
ContextSlideOutcome trySlidePrefill(
41+
llama_context* lctx, llama_pos nPast, llama_pos firstMsgTokens,
42+
llama_pos nTokensToAppend, llama_pos nDiscarded,
43+
ToolsCompactController& tools, const IContextSliderOps& ops) {
44+
45+
const auto nCtx = ops.nCtx(lctx);
46+
47+
// Check if sliding is needed
48+
if (nPast + nTokensToAppend < nCtx) {
49+
return {ContextSlideOutcome::Kind::NotNeeded, nPast, 0};
50+
}
51+
52+
// Clamp discard so it never eats into tool tokens
53+
llama_pos discard = tools.clampDiscard(nDiscarded, firstMsgTokens);
54+
llama_pos leftTokens = nPast - firstMsgTokens - discard;
55+
56+
// Try partial slide
57+
if (leftTokens >= 0 && discard > 0 &&
58+
nPast + nTokensToAppend - discard < nCtx) {
59+
auto mem = ops.memory(lctx);
60+
ops.seqRm(mem, 0, firstMsgTokens, firstMsgTokens + discard);
61+
ops.seqAdd(mem, 0, firstMsgTokens + discard, nPast, -discard);
62+
llama_pos newNPast = nPast - discard;
63+
tools.onSlide(discard, firstMsgTokens);
64+
return {ContextSlideOutcome::Kind::Slid, newNPast, discard};
65+
}
66+
67+
// Fallback: wipe everything after the first message
68+
if (leftTokens < 0 && firstMsgTokens + nTokensToAppend < nCtx &&
69+
nDiscarded > 0) {
70+
auto mem = ops.memory(lctx);
71+
ops.seqRm(mem, 0, firstMsgTokens, nPast);
72+
llama_pos wiped = nPast - firstMsgTokens;
73+
if (tools.enabled()) {
74+
tools.reset();
75+
}
76+
return {ContextSlideOutcome::Kind::FullWipe, firstMsgTokens, wiped};
77+
}
78+
79+
// Cannot free enough space
80+
return {ContextSlideOutcome::Kind::Overflow, nPast, 0};
81+
}
82+
83+
ContextSlideOutcome trySlideGeneration(
84+
llama_context* lctx, llama_pos nPast, llama_pos firstMsgTokens,
85+
llama_pos nDiscarded, ToolsCompactController& tools,
86+
const IContextSliderOps& ops) {
87+
88+
const auto nCtx = ops.nCtx(lctx);
89+
90+
// Check if sliding is needed (need room for 1 more token)
91+
if (nPast + 1 <= nCtx || nDiscarded == 0) {
92+
return {ContextSlideOutcome::Kind::NotNeeded, nPast, 0};
93+
}
94+
95+
// Clamp discard so it never eats into tool tokens
96+
llama_pos discard = tools.clampDiscard(nDiscarded, firstMsgTokens);
97+
98+
// Handle degenerate boundary case
99+
if (discard == 0 && tools.degenerateBoundary(firstMsgTokens)) {
100+
QLOG_IF(
101+
Priority::WARNING,
102+
string_format(
103+
"[ContextSlider] tools_compact anchor equals first message "
104+
"boundary "
105+
"(nPastBeforeTools=%d, firstMsgTokens=%d) while context is full; "
106+
"resetting tool boundary before retry\n",
107+
tools.anchor(),
108+
firstMsgTokens));
109+
tools.reset();
110+
discard = tools.clampDiscard(nDiscarded, firstMsgTokens);
111+
}
112+
113+
// If still cannot discard, return NotNeeded (caller handles overflow)
114+
if (discard == 0) {
115+
QLOG_IF(
116+
Priority::WARNING,
117+
string_format(
118+
"[ContextSlider] context is full but cannot discard tokens "
119+
"(nPast=%d, nCtx=%d, nDiscarded=%d, firstMsgTokens=%d, "
120+
"nPastBeforeTools=%d, toolsCompact=%s)\n",
121+
nPast,
122+
nCtx,
123+
nDiscarded,
124+
firstMsgTokens,
125+
tools.anchor(),
126+
tools.enabled() ? "true" : "false"));
127+
return {ContextSlideOutcome::Kind::NotNeeded, nPast, 0};
128+
}
129+
130+
// Perform the slide
131+
auto mem = ops.memory(lctx);
132+
ops.seqRm(mem, 0, firstMsgTokens, firstMsgTokens + discard);
133+
ops.seqAdd(mem, 0, firstMsgTokens + discard, nPast, -discard);
134+
llama_pos newNPast = nPast - discard;
135+
tools.onSlide(discard, firstMsgTokens);
136+
return {ContextSlideOutcome::Kind::Slid, newNPast, discard};
137+
}
Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
#pragma once
2+
3+
#include <llama.h>
4+
5+
class ToolsCompactController;
6+
7+
using ContextSliderMemoryHandle =
8+
decltype(llama_get_memory(static_cast<llama_context*>(nullptr)));
9+
10+
/// Small indirection layer around llama context/memory operations.
11+
///
12+
/// This makes ContextSlider testable without requiring a real llama_context.
13+
struct IContextSliderOps {
14+
virtual ~IContextSliderOps() = default;
15+
virtual llama_pos nCtx(llama_context* lctx) const = 0;
16+
virtual ContextSliderMemoryHandle memory(llama_context* lctx) const = 0;
17+
virtual void seqRm(
18+
ContextSliderMemoryHandle mem, llama_seq_id seqId, llama_pos startPos,
19+
llama_pos endPos) const = 0;
20+
virtual void seqAdd(
21+
ContextSliderMemoryHandle mem, llama_seq_id seqId, llama_pos startPos,
22+
llama_pos endPos, llama_pos delta) const = 0;
23+
};
24+
25+
/// Returns the default llama-backed ops implementation.
26+
const IContextSliderOps& defaultContextSliderOps();
27+
28+
/// Outcome of a sliding-window operation on the KV cache.
29+
struct ContextSlideOutcome {
30+
enum class Kind {
31+
NotNeeded, // Context had enough room; no slide performed
32+
Slid, // Successfully discarded tokens via partial slide
33+
FullWipe, // Fallback: wiped everything after firstMsgTokens (prefill only)
34+
Overflow, // Could not free enough space; caller should throw
35+
};
36+
37+
Kind kind = Kind::NotNeeded;
38+
llama_pos newNPast = 0; // Updated nPast after the slide
39+
llama_pos discarded = 0; // Number of tokens actually discarded
40+
};
41+
42+
/// Attempts to slide the context window during prefill (eval) phase.
43+
///
44+
/// This handles the case where adding nTokensToAppend would overflow the
45+
/// context. It tries to discard tokens from the middle (after firstMsgTokens)
46+
/// while respecting the tools_compact anchor via ToolsCompactController.
47+
///
48+
/// On success (Slid or FullWipe), the KV cache has been modified and newNPast
49+
/// reflects the new position. On NotNeeded, no action was taken. On Overflow,
50+
/// the caller should throw a context overflow error.
51+
///
52+
/// @param lctx The llama context for KV cache operations
53+
/// @param nPast Current token position in the context
54+
/// @param firstMsgTokens Number of tokens in the first message (protected)
55+
/// @param nTokensToAppend Number of tokens about to be appended
56+
/// @param nDiscarded Maximum tokens the caller allows to discard
57+
/// @param tools Controller for tools_compact anchor management
58+
/// @return ContextSlideOutcome describing what happened and the new state
59+
ContextSlideOutcome trySlidePrefill(
60+
llama_context* lctx, llama_pos nPast, llama_pos firstMsgTokens,
61+
llama_pos nTokensToAppend, llama_pos nDiscarded,
62+
ToolsCompactController& tools,
63+
const IContextSliderOps& ops = defaultContextSliderOps());
64+
65+
/// Attempts to slide the context window during generation phase.
66+
///
67+
/// This handles the case where generating one more token would overflow the
68+
/// context. Unlike prefill, there is no FullWipe fallback during generation.
69+
/// If sliding cannot free space, returns NotNeeded with no action.
70+
///
71+
/// @param lctx The llama context for KV cache operations
72+
/// @param nPast Current token position in the context
73+
/// @param firstMsgTokens Number of tokens in the first message (protected)
74+
/// @param nDiscarded Maximum tokens the caller allows to discard
75+
/// @param tools Controller for tools_compact anchor management
76+
/// @return ContextSlideOutcome describing what happened and the new state
77+
ContextSlideOutcome trySlideGeneration(
78+
llama_context* lctx, llama_pos nPast, llama_pos firstMsgTokens,
79+
llama_pos nDiscarded, ToolsCompactController& tools,
80+
const IContextSliderOps& ops = defaultContextSliderOps());

0 commit comments

Comments
 (0)