You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: resources/multi-agent/patterns/PATTERNS.md
+45-1Lines changed: 45 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -282,13 +282,57 @@ Each step is more invasive than the last. Soft restart fixes transient failures
282
282
283
283
---
284
284
285
+
## Pattern 7: External Triggers Over Agent Discipline
286
+
287
+
### What It Is
288
+
289
+
The principle that any behavior which depends on an agent *remembering* to do something on its own will eventually fail. Every recurring behavior that matters must be driven by an external trigger — a cron timer, an event hook, a file watcher — never by the agent deciding to do it.
290
+
291
+
### Why It Works
292
+
293
+
Agents don't form habits. This is the single most important operational insight from running autonomous agents in production, and it's the meta-principle that justifies Patterns 1, 3, 4, and 5.
294
+
295
+
A cron timer fires every time. An event trigger fires when its condition is met. An agent *deciding* to check something, reset something, or report something? It works for a while. Maybe a day. Maybe a week. Then it silently stops, and you'll never know exactly when or why. The agent didn't rebel. It didn't refuse. It just... drifted past it. The context shifted, the priority changed, the prompt got crowded, and the behavior dropped out of the window.
296
+
297
+
This isn't a bug to fix. It's a property of non-deterministic systems. The same variability that makes LLMs useful — their flexibility, creativity, ability to handle novel situations — is what makes them unreliable for recurring discipline. You can't prompt your way out of this. You can't fine-tune it away. You design around it.
298
+
299
+
### Implementation Levels
300
+
301
+
**Level 1 — Timers for Critical Behaviors:**
302
+
Session resets, health check-ins, memory archival, status reports — anything that must happen on a schedule gets a cron job or systemd timer. The agent doesn't choose when. The timer fires, the agent responds.
303
+
304
+
**Level 2 — Event Triggers for Reactive Behaviors:**
305
+
File size exceeding a threshold triggers session cleanup. A git push triggers a sync. A failed health check triggers a restart. The agent doesn't monitor — the infrastructure does, and pokes the agent (or acts without it) when something needs attention.
306
+
307
+
**Level 3 — Zero Reliance on Agent-Initiated Discipline:**
308
+
The system is designed so that no critical behavior requires the agent to "remember" anything between sessions. Identity comes from a file the agent reads, not from what it recalls. Coordination comes from git state, not from what agents told each other. Accountability comes from a supervisor's ping, not from the agent's conscience.
309
+
310
+
### Watch Out For
311
+
312
+
-**The temptation to trust it after it works for a while.** Your agent will reliably self-report for days. You'll think "maybe I don't need the timer." You do. The moment you remove it is the moment drift begins, and you won't notice for another week.
313
+
-**Confusing capability with reliability.** An agent *can* check its own session size. It *can* decide to reset. It *can* remember to follow up. "Can" is not "will, every time, forever." Design for "will."
314
+
-**Hope is not an ops strategy.** If the sentence starts with "the agent should..." and doesn't end with "...because this timer/trigger forces it," the behavior will drop.
315
+
316
+
### This Toolkit's Implementation
317
+
318
+
This pattern is embedded throughout the other patterns rather than having a single implementation:
319
+
320
+
-[Pattern 1 (Supervision)](../../multi-agent/patterns/PATTERNS.md) — Android-18 supervisor fires on a timer, not on agent initiative
321
+
-[Pattern 4 (Session Lifecycle)](../../multi-agent/patterns/PATTERNS.md) — session cleanup runs on file-size triggers, not on agent self-awareness
322
+
-[Pattern 5 (Memory Stratification)](../../multi-agent/patterns/PATTERNS.md) — Memory Shepherd resets on a timer, not on the agent deciding it's time
323
+
-[session-cleanup.sh](../../tools/session-cleanup.sh) — file-size watcher that triggers session rotation
324
+
-[ai-health-monitor.sh](../../tools/ai-health-monitor.sh) — cron-driven health checks, not agent self-monitoring
325
+
326
+
---
327
+
285
328
## Applying These Patterns
286
329
287
-
You don't need all six. Start with what hurts most:
330
+
You don't need all seven. Start with what hurts most:
288
331
289
332
-**Agents keep crashing?** Start with Pattern 6 (Self-Healing) and Pattern 4 (Session Lifecycle).
290
333
-**Agents lose context between sessions?** Start with Pattern 2 (Workspace-as-Brain) and Pattern 5 (Memory Stratification).
291
334
-**Agents wander off task?** Start with Pattern 3 (Mission Governance) and Pattern 1 (Supervision).
335
+
-**Agents are unreliable at recurring tasks?** Start with Pattern 7 (External Triggers) — then audit every behavior that depends on agent initiative.
292
336
-**Building a multi-agent system?** Start with Pattern 1 (Supervision) and Pattern 3 (Mission Governance), then layer in the rest.
293
337
294
338
The patterns compose well. Each addresses a different failure mode, and each is independent — you can implement Pattern 2 without Pattern 1, or Pattern 6 without Pattern 3. But together, they form a complete safety stack for autonomous agent operations.
What changed when we could actually see where the money was going.
4
+
5
+
---
6
+
7
+
## Before: Flying Blind
8
+
9
+
Running 3 AI agents 24/7 across two GPU servers with zero cost visibility. Per-turn costs were in the range of 5-6 cents, with some agents spiking to roughly 8 cents on heavy turns. We didn't know that at the time — there was nothing showing us.
10
+
11
+
The symptoms were indirect:
12
+
- API rate limit hits at unexpected times
13
+
- Monthly cloud bills that felt high but couldn't be attributed to specific agents or tasks
14
+
- No way to tell if a session was productive or if an agent was spinning in a loop
15
+
- Context windows silently growing, inflating per-turn costs with every message
16
+
17
+
The $79/day (Android-17) and $113/day (Todd) snapshots from early operation were discovered *after* Token Spy was running — before that, we had no per-agent breakdown at all.
18
+
19
+
## What Token Spy Revealed
20
+
21
+
Once per-turn token counts, cost breakdowns, and session sizes were visible on a live dashboard, the problems became obvious within hours:
22
+
23
+
**1. Model mismatch.** Agents were using frontier cloud models for tasks that a local 32B model handles fine — code generation, structured output, tool calling. Moving heavy-volume work to local Qwen via vLLM eliminated those costs entirely.
24
+
25
+
**2. Session bloat.** Without visibility into session size, context windows grew unchecked. Larger context = more input tokens per turn = higher cost per turn, compounding over time. Once we could see session sizes in real time, we added automated resets (Pattern 4) that kept sessions lean.
26
+
27
+
**3. Cache underutilization.** Anthropic's prompt caching can significantly reduce input costs for repetitive system prompts and tool definitions. Token Spy showed cache hit rates per agent, revealing which agents were benefiting from caching and which were missing it due to prompt variability.
28
+
29
+
**4. Waste loops.** Agents occasionally enter loops — calling the same tool repeatedly, generating similar responses, or retrying failed operations. Without per-turn visibility, these loops burned tokens silently for hours. Token Spy's session timeline made them immediately visible.
30
+
31
+
## After: Roughly 3-4x Reduction
32
+
33
+
Within a few days of having visibility, effective per-turn costs dropped by approximately 3-4x through a combination of:
34
+
35
+
- Routing high-volume work to local inference (zero marginal cost)
36
+
- Controlling session sizes to reduce input token inflation
37
+
- Matching models to actual task complexity (frontier for reasoning, local for execution)
38
+
- Catching and killing waste loops early
39
+
40
+
The exact reduction varies by workload and agent. The direction is consistent: you make very different decisions when you can see what's happening.
41
+
42
+
## Takeaway
43
+
44
+
The cost problem was never about expensive models. It was about not knowing which costs were productive and which were waste. Token Spy didn't reduce costs by being clever — it reduced costs by making the data visible so that humans could make obvious decisions they couldn't make before.
45
+
46
+
---
47
+
48
+
*See [PRODUCT-SCOPE.md](PRODUCT-SCOPE.md) for the full feature roadmap and [PHASE1-ARCHITECTURE.md](PHASE1-ARCHITECTURE.md) for the technical design.*
When using vLLM as a backend for agent frameworks (OpenClaw, LangChain, custom loops), several parameters and response fields cause silent failures. No error, no warning — the agent just gets nothing back or the framework chokes on unexpected fields. These were discovered through production debugging and are handled by the [vllm-tool-proxy](../../tools/vllm-tool-proxy.py).
61
+
62
+
### Problem: Streaming + Tool Calls Don't Mix
63
+
64
+
If the client sends `"stream": true` with tools present, vLLM streams tokens incrementally. But tool calls embedded in content (common with models that output tool JSON as text rather than structured `tool_calls`) can't be extracted from a stream mid-flight. The framework receives partial JSON fragments and fails silently.
65
+
66
+
**Fix:** Force `"stream": false` when `tools` are present. Extract tool calls from the complete response, then optionally re-wrap as SSE if the client expects streaming.
67
+
68
+
### Problem: `stream_options` on Non-Streaming Requests
69
+
70
+
vLLM 0.14+ rejects requests that include `"stream_options"` when `"stream"` is `false`. The rejection is silent — the request either hangs or returns an empty response. Many frameworks send `stream_options` by default regardless of streaming mode.
71
+
72
+
**Fix:** Strip `"stream_options"` from the request body whenever `"stream"` is `false` or absent.
73
+
74
+
### Problem: Extra Response Fields Break Framework Parsers
75
+
76
+
vLLM returns fields that don't exist in the OpenAI spec. Frameworks that strictly validate response schemas fail silently when they encounter these. The problematic fields:
**Fix:** Strip all non-standard fields from the response before forwarding to the framework. Also ensure `tool_calls` is absent (not an empty list `[]`) when no tools were called — some frameworks treat `[]` as "tools were attempted and failed."
87
+
88
+
### Problem: Tool Calls Returned as Text Content
89
+
90
+
Some models (notably GPT-OSS-120B, some Qwen configurations) output tool calls as plain text in the `content` field rather than as structured `tool_calls`. The response looks normal to vLLM but the framework sees no tool calls and falls back to treating it as a text response.
2.`{"name": "func", "arguments": {...}}` — bare JSON as content
95
+
3. Multi-line JSON — multiple tool calls on separate lines
96
+
97
+
**Fix:** Post-process the response: if `tool_calls` is empty but `content` contains parseable tool JSON in any of these formats, extract it, build proper `tool_calls` structures, and set `finish_reason` to `"tool_calls"`.
98
+
99
+
### Implementation
100
+
101
+
All of these fixes are implemented in [`tools/vllm-tool-proxy.py`](../../tools/vllm-tool-proxy.py) — a Flask proxy that sits between the agent framework and vLLM. It also includes a loop breaker (`MAX_TOOL_CALLS = 20`) to abort runaway tool-calling loops.
0 commit comments