tetherto
diff --git a/‎packages/qvac-lib-infer-llamacpp-llm/CHANGELOG.md‎
Lines changed: 56 additions & 0 deletions b/‎packages/qvac-lib-infer-llamacpp-llm/CHANGELOG.md‎
Lines changed: 56 additions & 0 deletions
diff --git a/‎packages/qvac-lib-infer-llamacpp-llm/CMakeLists.txt‎
Lines changed: 4 additions & 0 deletions b/‎packages/qvac-lib-infer-llamacpp-llm/CMakeLists.txt‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎packages/qvac-lib-infer-llamacpp-llm/README.md‎
Lines changed: 2 additions & 2 deletions b/‎packages/qvac-lib-infer-llamacpp-llm/README.md‎
Lines changed: 2 additions & 2 deletions
diff --git a/‎packages/qvac-lib-infer-llamacpp-llm/addon/src/model-interface/CacheManager.cpp‎
Lines changed: 3 additions & 9 deletions b/‎packages/qvac-lib-infer-llamacpp-llm/addon/src/model-interface/CacheManager.cpp‎
Lines changed: 3 additions & 9 deletions
diff --git a/‎packages/qvac-lib-infer-llamacpp-llm/addon/src/model-interface/CacheManager.hpp‎
Lines changed: 9 additions & 6 deletions b/‎packages/qvac-lib-infer-llamacpp-llm/addon/src/model-interface/CacheManager.hpp‎
Lines changed: 9 additions & 6 deletions
diff --git a/‎packages/qvac-lib-infer-llamacpp-llm/addon/src/model-interface/ContextSlider.cpp‎
Lines changed: 137 additions & 0 deletions b/‎packages/qvac-lib-infer-llamacpp-llm/addon/src/model-interface/ContextSlider.cpp‎
Lines changed: 137 additions & 0 deletions
diff --git a/‎packages/qvac-lib-infer-llamacpp-llm/addon/src/model-interface/ContextSlider.hpp‎
Lines changed: 80 additions & 0 deletions b/‎packages/qvac-lib-infer-llamacpp-llm/addon/src/model-interface/ContextSlider.hpp‎
Lines changed: 80 additions & 0 deletions
@@ -1,5 +1,61 @@
 # Changelog
 
+## [0.17.0] - 2026-04-21
+
+### Changed
+
+#### `tools_at_end` renamed to `tools_compact`
+
+**Breaking**: The `tools_at_end` configuration option has been renamed to `tools_compact`. The old key is no longer recognized.
+
+#### Anchored tool placement for multi-round tool chains
+
+Tools are now anchored after the **last user message** (via a two-pass Jinja2 template that tracks `last_user_idx`) instead of being appended at the very end of the prompt. The tool boundary is set once on the first round and preserved across chain rounds, so tools stay in the KV cache while the model is still calling tools. Trimming now only happens when the chain completes (output contains no `<tool_call>` tag), instead of after every turn.
+
+This eliminates redundant tokenize → eval → trim cycles during multi-round tool chains and matches the model's expected prompt layout more closely.
+
+#### `<think>` blocks stripped from assistant history
+
+The Qwen3 tools-dynamic template no longer re-injects `<think>…</think>` reasoning blocks into assistant history. Prior assistant messages are replayed with the thinking content stripped, which reduces token waste and avoids the model treating stale reasoning as context.
+
+#### `tools_compact` prompt-shape validation tightened
+
+`tools_compact` now validates prompt layout before inference and fails fast with `InvalidArgument` for malformed inputs (for example: required tools omitted, non-contiguous tool block, tools not attached to the last user/tool anchor, or tools not placed at the end).
+
+### Fixed
+
+#### Context sliding with `tools_compact` could corrupt tool boundary tracking
+
+When context sliding (token discard) occurred during generation or prefill with `tools_compact` enabled, the `nPastBeforeTools` boundary could become stale. This caused post-generation trim to remove the wrong tail region and could leave tool tokens in the KV cache across turns.
+
+Sliding is now centralized through `ContextSlider` + `ToolsCompactController`:
+- `clampDiscard()` caps discard so sliding never crosses into protected tool tokens
+- `onSlide()` keeps `nPastBeforeTools` aligned after each slide
+- Fallback full-wipe paths reset controller state to avoid stale boundaries
+- Applied consistently in both `TextLlmContext` and `MtmdLlmContext`
+
+#### Output duplication in streaming mode with `tools_compact`
+
+In streaming mode the captured output buffer was being returned as the final result, causing the SDK to see every token twice (once streamed, once in the result). The captured buffer is now used only for internal `<tool_call>` detection.
+
+#### Generation prompt added on system-only prefill
+
+When `nPast=0` and the only message was a system prompt, `add_generation_prompt` was hardcoded to `true`, injecting a stale `<|im_start|>assistant` token into the cache. Now checks the actual last message role.
+
+#### `"tool"` role not treated as turn-ending for generation prompt
+
+Messages with role `"tool"` (tool call results) were not triggering `add_generation_prompt`, causing empty responses on tool chain continuation. Now treated the same as `"user"` for generation prompt purposes.
+
+#### Empty chat message array now fails with `EmptyPrompt`
+
+`tokenizeChat()` now throws `StatusError(EmptyPrompt)` when called with no chat messages, making empty prompt handling explicit and consistent for both text and multimodal contexts.
+
+### Added
+
+- `runtimeDebugStats()` internal method on `LlamaModel` exposing `nPastBeforeTools`, `firstMsgTokens`, and `toolsTrimmed`
+- Comprehensive C++ unit tests for Qwen3 tools-dynamic template and cache management with tools_compact
+- Regression tests for context sliding with anchored tools: clamped discard, anchor updates after slide, unclamped sliding with long conversations, and sliding during generation
+
 ## [0.16.0] - 2026-04-14
 
 This release migrates the LLM addon off `BaseInference` inheritance and the `WeightsProvider` download layer onto the composable `createJobHandler` + `exclusiveRunQueue` utilities from `@qvac/infer-base@^0.4.0`. The constructor signature is replaced with a single object whose `files.model` field is an ordered array of absolute paths and `files.projectionModel` is an optional absolute path for multimodal models. This is a breaking change — every caller must update.
 
@@ -64,11 +64,13 @@ endif()
   list(APPEND ADDON_SOURCES
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/AsyncWeightsLoader.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/CacheManager.cpp
+    ${PROJECT_SOURCE_DIR}/addon/src/model-interface/ContextSlider.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaLazyInitializeBackend.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaModel.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaFinetuningHelpers.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/MtmdLlmContext.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/TextLlmContext.cpp
+    ${PROJECT_SOURCE_DIR}/addon/src/model-interface/ToolsCompactController.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/ModelMetadata.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/utils/LoggingMacros.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/utils/BackendSelection.cpp
@@ -110,10 +112,12 @@ if(BUILD_CLI)
     ${PROJECT_SOURCE_DIR}/addon/src/cli/cli_tool.cc
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/AsyncWeightsLoader.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/CacheManager.cpp
+    ${PROJECT_SOURCE_DIR}/addon/src/model-interface/ContextSlider.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaLazyInitializeBackend.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/LlamaModel.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/MtmdLlmContext.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/TextLlmContext.cpp
+    ${PROJECT_SOURCE_DIR}/addon/src/model-interface/ToolsCompactController.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/model-interface/ModelMetadata.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/utils/LoggingMacros.cpp
     ${PROJECT_SOURCE_DIR}/addon/src/utils/BackendSelection.cpp
 
@@ -173,7 +173,7 @@ const config = {
 | presence_penalty  | float                                       | 0                            | Presence penalty for sampling                         |
 | frequency_penalty | float                                       | 0                            | Frequency penalty for sampling                        |
 | tools             | `"true"` or `"false"`                       | `"false"`                    | Enable tool calling with jinja templating             |
-| tools_at_end      | `"true"` or `"false"`                       | `"false"`                    | Place tools at end of prompt ([details](./docs/tools-at-end.md)) |
+| tools_compact      | `"true"` or `"false"`                       | `"false"`                    | Compact tool tokens from KV cache between turns ([details](./docs/tools-compact.md)) |
 | verbosity         | 0 – 3 (0=ERROR, 1=WARNING, 2=INFO, 3=DEBUG) | 0                            | Logging verbosity level                               |
 | n_discarded       | integer                                     | 0                            | Tokens to discard in sliding window context           |
 | main-gpu          | integer, `"integrated"`, or `"dedicated"`   | —                            | GPU selection for multi-GPU systems                   |
@@ -315,7 +315,7 @@ npm run quickstart
 -   [LoRA Finetuning Pause/Resume](./examples/finetune/simple-lora-finetune-pause-resume.js) – Pause and resume finetuning.
 -   [LoRA Inference](./examples/simple-lora-inference.js) – Inference with a finetuned LoRA adapter.
 -   [Smart Home Finetune Showcase](./examples/finetune/showcase/smart-home-finetune.js) – Train a smart home tool-calling specialist, then [evaluate](./examples/finetune/showcase/smart-home-finetuned-test.js) baseline vs finetuned.
--   [Bench Tools Placement](./examples/benchToolsPlacement.js) – Benchmarks standard vs `tools_at_end` placement across multi-turn conversations.
+-   [Bench Tools Placement](./examples/benchToolsPlacement.js) – Benchmarks standard vs `tools_compact` placement across multi-turn conversations.
 -   [Test Tool Removal](./examples/testToolRemoval.js) – Demonstrates dynamic tool addition and removal between turns.
 
 ## OCR with Vision-Language Models
 
@@ -29,17 +29,11 @@ bool CacheManager::isFileInitialized(const std::filesystem::path& path) {
 }
 
 bool CacheManager::handleCache(
-    std::vector<common_chat_msg>& chatMsgs,
-    std::vector<common_chat_tool>& tools, const std::string& inputPrompt,
-    std::function<
-        std::pair<std::vector<common_chat_msg>, std::vector<common_chat_tool>>(
-            const std::string&)>
-        formatPrompt,
+    ParsedPromptPayload& parsedPrompt, const std::string& inputPrompt,
+    std::function<ParsedPromptPayload(const std::string&)> formatPrompt,
     const std::string& cacheKey) {
 
-  auto formatted = formatPrompt(inputPrompt);
-  chatMsgs = std::move(formatted.first);
-  tools = std::move(formatted.second);
+  parsedPrompt = formatPrompt(inputPrompt);
 
   if (cacheKey.empty()) {
     if (hasActiveCache()) {
 
@@ -8,21 +8,24 @@
 #include <llama.h>
 
 #include "LlmContext.hpp"
+#include "ToolsCompactController.hpp"
 #include "common/chat.h"
 
+struct ParsedPromptPayload {
+  std::vector<common_chat_msg> chatMsgs;
+  std::vector<common_chat_tool> tools;
+  PromptLayout layout;
+};
+
 class CacheManager {
 public:
   CacheManager(
       LlmContext* llmContext, llama_pos configuredNDiscarded,
       std::function<void(bool)> resetStateCallback);
 
   bool handleCache(
-      std::vector<common_chat_msg>& chatMsgs,
-      std::vector<common_chat_tool>& tools, const std::string& inputPrompt,
-      std::function<std::pair<
-          std::vector<common_chat_msg>, std::vector<common_chat_tool>>(
-          const std::string&)>
-          formatPrompt,
+      ParsedPromptPayload& parsedPrompt, const std::string& inputPrompt,
+      std::function<ParsedPromptPayload(const std::string&)> formatPrompt,
       const std::string& cacheKey = "");
 
   bool loadCache();
 
@@ -0,0 +1,137 @@
+#include "ContextSlider.hpp"
+
+#include "ToolsCompactController.hpp"
+#include "common/common.h"
+#include "qvac-lib-inference-addon-cpp/Logger.hpp"
+#include "utils/LoggingMacros.hpp"
+
+using namespace qvac_lib_inference_addon_cpp::logger;
+
+namespace {
+class ContextSliderOps final : public IContextSliderOps {
+public:
+  llama_pos nCtx(llama_context* lctx) const override {
+    return static_cast<llama_pos>(llama_n_ctx(lctx));
+  }
+
+  ContextSliderMemoryHandle memory(llama_context* lctx) const override {
+    return llama_get_memory(lctx);
+  }
+
+  void seqRm(
+      ContextSliderMemoryHandle mem, llama_seq_id seqId, llama_pos startPos,
+      llama_pos endPos) const override {
+    llama_memory_seq_rm(mem, seqId, startPos, endPos);
+  }
+
+  void seqAdd(
+      ContextSliderMemoryHandle mem, llama_seq_id seqId, llama_pos startPos,
+      llama_pos endPos, llama_pos delta) const override {
+    llama_memory_seq_add(mem, seqId, startPos, endPos, delta);
+  }
+};
+} // namespace
+
+const IContextSliderOps& defaultContextSliderOps() {
+  static const ContextSliderOps ops;
+  return ops;
+}
+
+ContextSlideOutcome trySlidePrefill(
+    llama_context* lctx, llama_pos nPast, llama_pos firstMsgTokens,
+    llama_pos nTokensToAppend, llama_pos nDiscarded,
+    ToolsCompactController& tools, const IContextSliderOps& ops) {
+
+  const auto nCtx = ops.nCtx(lctx);
+
+  // Check if sliding is needed
+  if (nPast + nTokensToAppend < nCtx) {
+    return {ContextSlideOutcome::Kind::NotNeeded, nPast, 0};
+  }
+
+  // Clamp discard so it never eats into tool tokens
+  llama_pos discard = tools.clampDiscard(nDiscarded, firstMsgTokens);
+  llama_pos leftTokens = nPast - firstMsgTokens - discard;
+
+  // Try partial slide
+  if (leftTokens >= 0 && discard > 0 &&
+      nPast + nTokensToAppend - discard < nCtx) {
+    auto mem = ops.memory(lctx);
+    ops.seqRm(mem, 0, firstMsgTokens, firstMsgTokens + discard);
+    ops.seqAdd(mem, 0, firstMsgTokens + discard, nPast, -discard);
+    llama_pos newNPast = nPast - discard;
+    tools.onSlide(discard, firstMsgTokens);
+    return {ContextSlideOutcome::Kind::Slid, newNPast, discard};
+  }
+
+  // Fallback: wipe everything after the first message
+  if (leftTokens < 0 && firstMsgTokens + nTokensToAppend < nCtx &&
+      nDiscarded > 0) {
+    auto mem = ops.memory(lctx);
+    ops.seqRm(mem, 0, firstMsgTokens, nPast);
+    llama_pos wiped = nPast - firstMsgTokens;
+    if (tools.enabled()) {
+      tools.reset();
+    }
+    return {ContextSlideOutcome::Kind::FullWipe, firstMsgTokens, wiped};
+  }
+
+  // Cannot free enough space
+  return {ContextSlideOutcome::Kind::Overflow, nPast, 0};
+}
+
+ContextSlideOutcome trySlideGeneration(
+    llama_context* lctx, llama_pos nPast, llama_pos firstMsgTokens,
+    llama_pos nDiscarded, ToolsCompactController& tools,
+    const IContextSliderOps& ops) {
+
+  const auto nCtx = ops.nCtx(lctx);
+
+  // Check if sliding is needed (need room for 1 more token)
+  if (nPast + 1 <= nCtx || nDiscarded == 0) {
+    return {ContextSlideOutcome::Kind::NotNeeded, nPast, 0};
+  }
+
+  // Clamp discard so it never eats into tool tokens
+  llama_pos discard = tools.clampDiscard(nDiscarded, firstMsgTokens);
+
+  // Handle degenerate boundary case
+  if (discard == 0 && tools.degenerateBoundary(firstMsgTokens)) {
+    QLOG_IF(
+        Priority::WARNING,
+        string_format(
+            "[ContextSlider] tools_compact anchor equals first message "
+            "boundary "
+            "(nPastBeforeTools=%d, firstMsgTokens=%d) while context is full; "
+            "resetting tool boundary before retry\n",
+            tools.anchor(),
+            firstMsgTokens));
+    tools.reset();
+    discard = tools.clampDiscard(nDiscarded, firstMsgTokens);
+  }
+
+  // If still cannot discard, return NotNeeded (caller handles overflow)
+  if (discard == 0) {
+    QLOG_IF(
+        Priority::WARNING,
+        string_format(
+            "[ContextSlider] context is full but cannot discard tokens "
+            "(nPast=%d, nCtx=%d, nDiscarded=%d, firstMsgTokens=%d, "
+            "nPastBeforeTools=%d, toolsCompact=%s)\n",
+            nPast,
+            nCtx,
+            nDiscarded,
+            firstMsgTokens,
+            tools.anchor(),
+            tools.enabled() ? "true" : "false"));
+    return {ContextSlideOutcome::Kind::NotNeeded, nPast, 0};
+  }
+
+  // Perform the slide
+  auto mem = ops.memory(lctx);
+  ops.seqRm(mem, 0, firstMsgTokens, firstMsgTokens + discard);
+  ops.seqAdd(mem, 0, firstMsgTokens + discard, nPast, -discard);
+  llama_pos newNPast = nPast - discard;
+  tools.onSlide(discard, firstMsgTokens);
+  return {ContextSlideOutcome::Kind::Slid, newNPast, discard};
+}
@@ -0,0 +1,80 @@
+#pragma once
+
+#include <llama.h>
+
+class ToolsCompactController;
+
+using ContextSliderMemoryHandle =
+    decltype(llama_get_memory(static_cast<llama_context*>(nullptr)));
+
+/// Small indirection layer around llama context/memory operations.
+///
+/// This makes ContextSlider testable without requiring a real llama_context.
+struct IContextSliderOps {
+  virtual ~IContextSliderOps() = default;
+  virtual llama_pos nCtx(llama_context* lctx) const = 0;
+  virtual ContextSliderMemoryHandle memory(llama_context* lctx) const = 0;
+  virtual void seqRm(
+      ContextSliderMemoryHandle mem, llama_seq_id seqId, llama_pos startPos,
+      llama_pos endPos) const = 0;
+  virtual void seqAdd(
+      ContextSliderMemoryHandle mem, llama_seq_id seqId, llama_pos startPos,
+      llama_pos endPos, llama_pos delta) const = 0;
+};
+
+/// Returns the default llama-backed ops implementation.
+const IContextSliderOps& defaultContextSliderOps();
+
+/// Outcome of a sliding-window operation on the KV cache.
+struct ContextSlideOutcome {
+  enum class Kind {
+    NotNeeded, // Context had enough room; no slide performed
+    Slid,      // Successfully discarded tokens via partial slide
+    FullWipe,  // Fallback: wiped everything after firstMsgTokens (prefill only)
+    Overflow,  // Could not free enough space; caller should throw
+  };
+
+  Kind kind = Kind::NotNeeded;
+  llama_pos newNPast = 0;  // Updated nPast after the slide
+  llama_pos discarded = 0; // Number of tokens actually discarded
+};
+
+/// Attempts to slide the context window during prefill (eval) phase.
+///
+/// This handles the case where adding nTokensToAppend would overflow the
+/// context. It tries to discard tokens from the middle (after firstMsgTokens)
+/// while respecting the tools_compact anchor via ToolsCompactController.
+///
+/// On success (Slid or FullWipe), the KV cache has been modified and newNPast
+/// reflects the new position. On NotNeeded, no action was taken. On Overflow,
+/// the caller should throw a context overflow error.
+///
+/// @param lctx           The llama context for KV cache operations
+/// @param nPast          Current token position in the context
+/// @param firstMsgTokens Number of tokens in the first message (protected)
+/// @param nTokensToAppend Number of tokens about to be appended
+/// @param nDiscarded     Maximum tokens the caller allows to discard
+/// @param tools          Controller for tools_compact anchor management
+/// @return ContextSlideOutcome describing what happened and the new state
+ContextSlideOutcome trySlidePrefill(
+    llama_context* lctx, llama_pos nPast, llama_pos firstMsgTokens,
+    llama_pos nTokensToAppend, llama_pos nDiscarded,
+    ToolsCompactController& tools,
+    const IContextSliderOps& ops = defaultContextSliderOps());
+
+/// Attempts to slide the context window during generation phase.
+///
+/// This handles the case where generating one more token would overflow the
+/// context. Unlike prefill, there is no FullWipe fallback during generation.
+/// If sliding cannot free space, returns NotNeeded with no action.
+///
+/// @param lctx           The llama context for KV cache operations
+/// @param nPast          Current token position in the context
+/// @param firstMsgTokens Number of tokens in the first message (protected)
+/// @param nDiscarded     Maximum tokens the caller allows to discard
+/// @param tools          Controller for tools_compact anchor management
+/// @return ContextSlideOutcome describing what happened and the new state
+ContextSlideOutcome trySlideGeneration(
+    llama_context* lctx, llama_pos nPast, llama_pos firstMsgTokens,
+    llama_pos nDiscarded, ToolsCompactController& tools,
+    const IContextSliderOps& ops = defaultContextSliderOps());