|
1 | 1 | --- |
2 | | -description: "Six operational safety primitives that wrap every AgentOS LLM call: killswitch, cost guard, circuit breaker, stuck detection, action audit log. Prevent runaway loops, money fires, and zombie agents — independently or as one guard chain via wrapLLMCallback()." |
3 | | -keywords: [agent safety, llm circuit breaker, cost guard, stuck detector, agent killswitch, runaway agent, ai cost cap, agentos safety, operational guardrails] |
| 2 | +description: "Seven operational safety primitives that wrap every AgentOS LLM call: killswitch, cost guard, circuit breaker, provider health registry, stuck detection, action audit log. Prevent runaway loops, money fires, and zombie agents — independently or as one guard chain via wrapLLMCallback()." |
| 3 | +keywords: [agent safety, llm circuit breaker, provider health, llm fallback router, status-aware breaker, cost guard, stuck detector, agent killswitch, runaway agent, ai cost cap, agentos safety, operational guardrails] |
4 | 4 | --- |
5 | 5 |
|
6 | 6 | # Safety Primitives |
@@ -79,6 +79,71 @@ const stats = breaker.getStats(); |
79 | 79 | // { name: 'openai-api', state: 'closed', failureCount: 0, totalTripped: 0, ... } |
80 | 80 | ``` |
81 | 81 |
|
| 82 | +## LLMProviderHealthRegistry |
| 83 | + |
| 84 | +A status-aware, process-lifetime memory of LLM provider health, keyed by `providerId`. Wired into [`generateText`](https://github.com/framersai/agentos/blob/master/src/api/generateText.ts) and [`streamText`](https://github.com/framersai/agentos/blob/master/src/api/streamText.ts) so the next caller doesn't pay a full TLS round-trip to rediscover a provider that just returned `402 Insufficient Credits` or `401 Invalid API key`. |
| 85 | + |
| 86 | +The plain `CircuitBreaker` above uses a single failure-threshold + cooldown pair per instance — fine for one-off operations. The router needs **per-error-class** behavior: open immediately on a payment or auth failure, but require a streak before tripping on a 429 or 5xx. This is what the registry adds. |
| 87 | + |
| 88 | +| Error class | Threshold | Cooldown | |
| 89 | +| -------------------------- | --------- | -------- | |
| 90 | +| 402 insufficient credits | 1 fail | 5 min | |
| 91 | +| 401, 403 auth/forbidden | 1 fail | 30 min | |
| 92 | +| 429 rate limit | 3 fails | 30 s | |
| 93 | +| 5xx + unclassifiable | 5 fails | 60 s | |
| 94 | + |
| 95 | +The 5-minute window on 402 reflects operational reality: credits might get topped up while a batch job is in flight. 30 minutes on 401/403 is longer because those failures usually require an env change plus redeploy. 429 cooldowns are intentionally short because rate limits typically lift in a single billing interval. |
| 96 | + |
| 97 | +### How the router uses it |
| 98 | + |
| 99 | +1. **Before the primary call**, `generateText` consults `globalLLMProviderHealth.isOpen(resolvedProviderId)`. If the breaker is open, it throws a synthetic `LLMProviderCircuitOpenError` with `httpStatus: 503`. The existing `isRetryableError` check recognizes that status and routes the call into the fallback chain. No network round-trip, no TLS handshake, no waste. |
| 100 | +2. **On a real provider error** (anything caught in the outer try/catch), `recordFailure(providerId, error)` classifies the error by HTTP status and either trips immediately (for 401/402/403) or increments the streak counter (for 429/5xx). |
| 101 | +3. **On success**, `recordSuccess(providerId)` resets the streak counter so a future transient failure starts fresh. A single success does NOT shorten an already-open cooldown: the breaker is open precisely because we want to stop probing for a window. |
| 102 | +4. **In the fallback chain loop**, every fallback entry is checked against `isOpen()` before its attempt. A dead chain entry is skipped instantly, so the loop walks to the first healthy provider with O(N) constant-time checks rather than O(N) network calls. |
| 103 | + |
| 104 | +### Error classification |
| 105 | + |
| 106 | +The registry reads HTTP status from three sources, in order: |
| 107 | + |
| 108 | +1. `[NNN] ...` prefix in `error.message` — the shape `OpenRouterProvider` decorates its errors with so downstream regex-based routing can find them. |
| 109 | +2. `error.statusCode` numeric property — `OpenRouterProviderError` sets this explicitly. |
| 110 | +3. `error.status` numeric property — the Anthropic and OpenAI SDK shape. |
| 111 | + |
| 112 | +If none of those resolves, the error is treated as the conservative transient class (5-failure threshold, 60 s cooldown). Better to under-protect on a one-off network blip than lock out a healthy provider. |
| 113 | + |
| 114 | +### Config |
| 115 | + |
| 116 | +The policy table above is currently hardcoded. Make a per-class config object exposable if a host needs to override (e.g. a stricter 429 threshold for a low-quota account). |
| 117 | + |
| 118 | +### Usage |
| 119 | + |
| 120 | +```typescript |
| 121 | +import { globalLLMProviderHealth, LLMProviderHealthRegistry } from '@framers/agentos'; |
| 122 | + |
| 123 | +// Read state for an admin / diagnostics endpoint |
| 124 | +const stats = globalLLMProviderHealth.getStats('openrouter'); |
| 125 | +if (stats?.state === 'open') { |
| 126 | + console.log( |
| 127 | + `OpenRouter circuit open; ${stats.cooldownRemainingMs}ms until close. ` + |
| 128 | + `Last status: ${stats.lastStatusCode}, total trips: ${stats.totalTrips}`, |
| 129 | + ); |
| 130 | +} |
| 131 | + |
| 132 | +// Manually reset after a credit top-up so the next call probes immediately |
| 133 | +globalLLMProviderHealth.reset('openrouter'); |
| 134 | + |
| 135 | +// Construct a private registry for a test |
| 136 | +const isolated = new LLMProviderHealthRegistry(); |
| 137 | +isolated.recordFailure('mock-provider', new Error('[402] Test')); |
| 138 | +expect(isolated.isOpen('mock-provider')).toBe(true); |
| 139 | +``` |
| 140 | + |
| 141 | +### Why a singleton |
| 142 | + |
| 143 | +Provider health is process-wide state. Two concurrent `generateText` calls inside the same Node process see the same OpenRouter: if one just discovered it's at 402, the other shouldn't redo the discovery. The `globalLLMProviderHealth` singleton is the natural granularity. Tests construct their own `LLMProviderHealthRegistry` instances to keep state isolated across cases. |
| 144 | + |
| 145 | +The registry is **ephemeral by design**: it lives in memory and resets on server restart. Persistent provider-health tracking would add complexity (Redis, write-through cache invalidation) for a problem the in-process singleton already solves for the dominant case: a long-running batch job hammering a degraded provider. |
| 146 | + |
82 | 147 | ## ActionDeduplicator |
83 | 148 |
|
84 | 149 | Hash-based recent action tracking with a configurable time window and LRU eviction. The caller computes the key string -- this class is intentionally generic. Use it to prevent duplicate votes, duplicate posts, or any repeated action within a window. |
|
0 commit comments