Merge pull request #689 from Light-Heart-Labs/audit/resources-cleanup-and-hardening

Lightheartdevs · web-flow · commit 571af02a5fdd · 2026-03-31T12:28:12.000-04:00
Close 3 evidence gaps in resources library
diff --git a/resources/multi-agent/patterns/PATTERNS.md b/resources/multi-agent/patterns/PATTERNS.md
@@ -282,13 +282,57 @@ Each step is more invasive than the last. Soft restart fixes transient failures
 
 ---
 
+## Pattern 7: External Triggers Over Agent Discipline
+
+### What It Is
+
+The principle that any behavior which depends on an agent *remembering* to do something on its own will eventually fail. Every recurring behavior that matters must be driven by an external trigger — a cron timer, an event hook, a file watcher — never by the agent deciding to do it.
+
+### Why It Works
+
+Agents don't form habits. This is the single most important operational insight from running autonomous agents in production, and it's the meta-principle that justifies Patterns 1, 3, 4, and 5.
+
+A cron timer fires every time. An event trigger fires when its condition is met. An agent *deciding* to check something, reset something, or report something? It works for a while. Maybe a day. Maybe a week. Then it silently stops, and you'll never know exactly when or why. The agent didn't rebel. It didn't refuse. It just... drifted past it. The context shifted, the priority changed, the prompt got crowded, and the behavior dropped out of the window.
+
+This isn't a bug to fix. It's a property of non-deterministic systems. The same variability that makes LLMs useful — their flexibility, creativity, ability to handle novel situations — is what makes them unreliable for recurring discipline. You can't prompt your way out of this. You can't fine-tune it away. You design around it.
+
+### Implementation Levels
+
+**Level 1 — Timers for Critical Behaviors:**
+Session resets, health check-ins, memory archival, status reports — anything that must happen on a schedule gets a cron job or systemd timer. The agent doesn't choose when. The timer fires, the agent responds.
+
+**Level 2 — Event Triggers for Reactive Behaviors:**
+File size exceeding a threshold triggers session cleanup. A git push triggers a sync. A failed health check triggers a restart. The agent doesn't monitor — the infrastructure does, and pokes the agent (or acts without it) when something needs attention.
+
+**Level 3 — Zero Reliance on Agent-Initiated Discipline:**
+The system is designed so that no critical behavior requires the agent to "remember" anything between sessions. Identity comes from a file the agent reads, not from what it recalls. Coordination comes from git state, not from what agents told each other. Accountability comes from a supervisor's ping, not from the agent's conscience.
+
+### Watch Out For
+
+- **The temptation to trust it after it works for a while.** Your agent will reliably self-report for days. You'll think "maybe I don't need the timer." You do. The moment you remove it is the moment drift begins, and you won't notice for another week.
+- **Confusing capability with reliability.** An agent *can* check its own session size. It *can* decide to reset. It *can* remember to follow up. "Can" is not "will, every time, forever." Design for "will."
+- **Hope is not an ops strategy.** If the sentence starts with "the agent should..." and doesn't end with "...because this timer/trigger forces it," the behavior will drop.
+
+### This Toolkit's Implementation
+
+This pattern is embedded throughout the other patterns rather than having a single implementation:
+
+- [Pattern 1 (Supervision)](../../multi-agent/patterns/PATTERNS.md) — Android-18 supervisor fires on a timer, not on agent initiative
+- [Pattern 4 (Session Lifecycle)](../../multi-agent/patterns/PATTERNS.md) — session cleanup runs on file-size triggers, not on agent self-awareness
+- [Pattern 5 (Memory Stratification)](../../multi-agent/patterns/PATTERNS.md) — Memory Shepherd resets on a timer, not on the agent deciding it's time
+- [session-cleanup.sh](../../tools/session-cleanup.sh) — file-size watcher that triggers session rotation
+- [ai-health-monitor.sh](../../tools/ai-health-monitor.sh) — cron-driven health checks, not agent self-monitoring
+
+---
+
 ## Applying These Patterns
 
-You don't need all six. Start with what hurts most:
+You don't need all seven. Start with what hurts most:
 
 - **Agents keep crashing?** Start with Pattern 6 (Self-Healing) and Pattern 4 (Session Lifecycle).
 - **Agents lose context between sessions?** Start with Pattern 2 (Workspace-as-Brain) and Pattern 5 (Memory Stratification).
 - **Agents wander off task?** Start with Pattern 3 (Mission Governance) and Pattern 1 (Supervision).
+- **Agents are unreliable at recurring tasks?** Start with Pattern 7 (External Triggers) — then audit every behavior that depends on agent initiative.
 - **Building a multi-agent system?** Start with Pattern 1 (Supervision) and Pattern 3 (Mission Governance), then layer in the rest.
 
 The patterns compose well. Each addresses a different failure mode, and each is independent — you can implement Pattern 2 without Pattern 1, or Pattern 6 without Pattern 3. But together, they form a complete safety stack for autonomous agent operations.
diff --git a/resources/products/token-spy/COST-ANALYSIS.md b/resources/products/token-spy/COST-ANALYSIS.md
@@ -0,0 +1,48 @@
+# Token Spy — Cost Visibility Impact
+
+What changed when we could actually see where the money was going.
+
+---
+
+## Before: Flying Blind
+
+Running 3 AI agents 24/7 across two GPU servers with zero cost visibility. Per-turn costs were in the range of 5-6 cents, with some agents spiking to roughly 8 cents on heavy turns. We didn't know that at the time — there was nothing showing us.
+
+The symptoms were indirect:
+- API rate limit hits at unexpected times
+- Monthly cloud bills that felt high but couldn't be attributed to specific agents or tasks
+- No way to tell if a session was productive or if an agent was spinning in a loop
+- Context windows silently growing, inflating per-turn costs with every message
+
+The $79/day (Android-17) and $113/day (Todd) snapshots from early operation were discovered *after* Token Spy was running — before that, we had no per-agent breakdown at all.
+
+## What Token Spy Revealed
+
+Once per-turn token counts, cost breakdowns, and session sizes were visible on a live dashboard, the problems became obvious within hours:
+
+**1. Model mismatch.** Agents were using frontier cloud models for tasks that a local 32B model handles fine — code generation, structured output, tool calling. Moving heavy-volume work to local Qwen via vLLM eliminated those costs entirely.
+
+**2. Session bloat.** Without visibility into session size, context windows grew unchecked. Larger context = more input tokens per turn = higher cost per turn, compounding over time. Once we could see session sizes in real time, we added automated resets (Pattern 4) that kept sessions lean.
+
+**3. Cache underutilization.** Anthropic's prompt caching can significantly reduce input costs for repetitive system prompts and tool definitions. Token Spy showed cache hit rates per agent, revealing which agents were benefiting from caching and which were missing it due to prompt variability.
+
+**4. Waste loops.** Agents occasionally enter loops — calling the same tool repeatedly, generating similar responses, or retrying failed operations. Without per-turn visibility, these loops burned tokens silently for hours. Token Spy's session timeline made them immediately visible.
+
+## After: Roughly 3-4x Reduction
+
+Within a few days of having visibility, effective per-turn costs dropped by approximately 3-4x through a combination of:
+
+- Routing high-volume work to local inference (zero marginal cost)
+- Controlling session sizes to reduce input token inflation
+- Matching models to actual task complexity (frontier for reasoning, local for execution)
+- Catching and killing waste loops early
+
+The exact reduction varies by workload and agent. The direction is consistent: you make very different decisions when you can see what's happening.
+
+## Takeaway
+
+The cost problem was never about expensive models. It was about not knowing which costs were productive and which were waste. Token Spy didn't reduce costs by being clever — it reduced costs by making the data visible so that humans could make obvious decisions they couldn't make before.
+
+---
+
+*See [PRODUCT-SCOPE.md](PRODUCT-SCOPE.md) for the full feature roadmap and [PHASE1-ARCHITECTURE.md](PHASE1-ARCHITECTURE.md) for the technical design.*
diff --git a/resources/research/models/vllm-tool-calling.md b/resources/research/models/vllm-tool-calling.md
@@ -55,6 +55,53 @@ vllm serve meta-llama/Llama-3.1-8B-Instruct \
 - **Llama 3.2 small models**: Often fail to emit tool calls correctly
 - **JSON format issues**: Models may serialize arrays as strings instead of proper JSON
 
+## 5. Silent Compatibility Issues (Agent Frameworks)
+
+When using vLLM as a backend for agent frameworks (OpenClaw, LangChain, custom loops), several parameters and response fields cause silent failures. No error, no warning — the agent just gets nothing back or the framework chokes on unexpected fields. These were discovered through production debugging and are handled by the [vllm-tool-proxy](../../tools/vllm-tool-proxy.py).
+
+### Problem: Streaming + Tool Calls Don't Mix
+
+If the client sends `"stream": true` with tools present, vLLM streams tokens incrementally. But tool calls embedded in content (common with models that output tool JSON as text rather than structured `tool_calls`) can't be extracted from a stream mid-flight. The framework receives partial JSON fragments and fails silently.
+
+**Fix:** Force `"stream": false` when `tools` are present. Extract tool calls from the complete response, then optionally re-wrap as SSE if the client expects streaming.
+
+### Problem: `stream_options` on Non-Streaming Requests
+
+vLLM 0.14+ rejects requests that include `"stream_options"` when `"stream"` is `false`. The rejection is silent — the request either hangs or returns an empty response. Many frameworks send `stream_options` by default regardless of streaming mode.
+
+**Fix:** Strip `"stream_options"` from the request body whenever `"stream"` is `false` or absent.
+
+### Problem: Extra Response Fields Break Framework Parsers
+
+vLLM returns fields that don't exist in the OpenAI spec. Frameworks that strictly validate response schemas fail silently when they encounter these. The problematic fields:
+
+**Top-level:** `prompt_logprobs`, `prompt_token_ids`, `kv_transfer_params`, `service_tier`, `system_fingerprint`
+
+**Per-choice:** `stop_reason`, `token_ids`
+
+**Per-message:** `reasoning`, `reasoning_content`, `refusal`, `annotations`, `audio`, `function_call`
+
+**Usage:** `prompt_tokens_details`
+
+**Fix:** Strip all non-standard fields from the response before forwarding to the framework. Also ensure `tool_calls` is absent (not an empty list `[]`) when no tools were called — some frameworks treat `[]` as "tools were attempted and failed."
+
+### Problem: Tool Calls Returned as Text Content
+
+Some models (notably GPT-OSS-120B, some Qwen configurations) output tool calls as plain text in the `content` field rather than as structured `tool_calls`. The response looks normal to vLLM but the framework sees no tool calls and falls back to treating it as a text response.
+
+Three formats observed:
+1. `<tools>{"name": "func", "arguments": {...}}</tools>` — XML-wrapped JSON
+2. `{"name": "func", "arguments": {...}}` — bare JSON as content
+3. Multi-line JSON — multiple tool calls on separate lines
+
+**Fix:** Post-process the response: if `tool_calls` is empty but `content` contains parseable tool JSON in any of these formats, extract it, build proper `tool_calls` structures, and set `finish_reason` to `"tool_calls"`.
+
+### Implementation
+
+All of these fixes are implemented in [`tools/vllm-tool-proxy.py`](../../tools/vllm-tool-proxy.py) — a Flask proxy that sits between the agent framework and vLLM. It also includes a loop breaker (`MAX_TOOL_CALLS = 20`) to abort runaway tool-calling loops.
+
+---
+
 ## Quick Reference
 
 ```python