Skip to content

Commit c8d1521

Browse files
committed
docs(safety): document LLMProviderHealthRegistry in SAFETY_PRIMITIVES + linter scrub
1 parent 88412ee commit c8d1521

6 files changed

Lines changed: 175 additions & 110 deletions

File tree

docs/safety/SAFETY_PRIMITIVES.md

Lines changed: 67 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
---
2-
description: "Six operational safety primitives that wrap every AgentOS LLM call: killswitch, cost guard, circuit breaker, stuck detection, action audit log. Prevent runaway loops, money fires, and zombie agents — independently or as one guard chain via wrapLLMCallback()."
3-
keywords: [agent safety, llm circuit breaker, cost guard, stuck detector, agent killswitch, runaway agent, ai cost cap, agentos safety, operational guardrails]
2+
description: "Seven operational safety primitives that wrap every AgentOS LLM call: killswitch, cost guard, circuit breaker, provider health registry, stuck detection, action audit log. Prevent runaway loops, money fires, and zombie agents — independently or as one guard chain via wrapLLMCallback()."
3+
keywords: [agent safety, llm circuit breaker, provider health, llm fallback router, status-aware breaker, cost guard, stuck detector, agent killswitch, runaway agent, ai cost cap, agentos safety, operational guardrails]
44
---
55

66
# Safety Primitives
@@ -79,6 +79,71 @@ const stats = breaker.getStats();
7979
// { name: 'openai-api', state: 'closed', failureCount: 0, totalTripped: 0, ... }
8080
```
8181

82+
## LLMProviderHealthRegistry
83+
84+
A status-aware, process-lifetime memory of LLM provider health, keyed by `providerId`. Wired into [`generateText`](https://github.com/framersai/agentos/blob/master/src/api/generateText.ts) and [`streamText`](https://github.com/framersai/agentos/blob/master/src/api/streamText.ts) so the next caller doesn't pay a full TLS round-trip to rediscover a provider that just returned `402 Insufficient Credits` or `401 Invalid API key`.
85+
86+
The plain `CircuitBreaker` above uses a single failure-threshold + cooldown pair per instance — fine for one-off operations. The router needs **per-error-class** behavior: open immediately on a payment or auth failure, but require a streak before tripping on a 429 or 5xx. This is what the registry adds.
87+
88+
| Error class | Threshold | Cooldown |
89+
| -------------------------- | --------- | -------- |
90+
| 402 insufficient credits | 1 fail | 5 min |
91+
| 401, 403 auth/forbidden | 1 fail | 30 min |
92+
| 429 rate limit | 3 fails | 30 s |
93+
| 5xx + unclassifiable | 5 fails | 60 s |
94+
95+
The 5-minute window on 402 reflects operational reality: credits might get topped up while a batch job is in flight. 30 minutes on 401/403 is longer because those failures usually require an env change plus redeploy. 429 cooldowns are intentionally short because rate limits typically lift in a single billing interval.
96+
97+
### How the router uses it
98+
99+
1. **Before the primary call**, `generateText` consults `globalLLMProviderHealth.isOpen(resolvedProviderId)`. If the breaker is open, it throws a synthetic `LLMProviderCircuitOpenError` with `httpStatus: 503`. The existing `isRetryableError` check recognizes that status and routes the call into the fallback chain. No network round-trip, no TLS handshake, no waste.
100+
2. **On a real provider error** (anything caught in the outer try/catch), `recordFailure(providerId, error)` classifies the error by HTTP status and either trips immediately (for 401/402/403) or increments the streak counter (for 429/5xx).
101+
3. **On success**, `recordSuccess(providerId)` resets the streak counter so a future transient failure starts fresh. A single success does NOT shorten an already-open cooldown: the breaker is open precisely because we want to stop probing for a window.
102+
4. **In the fallback chain loop**, every fallback entry is checked against `isOpen()` before its attempt. A dead chain entry is skipped instantly, so the loop walks to the first healthy provider with O(N) constant-time checks rather than O(N) network calls.
103+
104+
### Error classification
105+
106+
The registry reads HTTP status from three sources, in order:
107+
108+
1. `[NNN] ...` prefix in `error.message` — the shape `OpenRouterProvider` decorates its errors with so downstream regex-based routing can find them.
109+
2. `error.statusCode` numeric property — `OpenRouterProviderError` sets this explicitly.
110+
3. `error.status` numeric property — the Anthropic and OpenAI SDK shape.
111+
112+
If none of those resolves, the error is treated as the conservative transient class (5-failure threshold, 60 s cooldown). Better to under-protect on a one-off network blip than lock out a healthy provider.
113+
114+
### Config
115+
116+
The policy table above is currently hardcoded. Make a per-class config object exposable if a host needs to override (e.g. a stricter 429 threshold for a low-quota account).
117+
118+
### Usage
119+
120+
```typescript
121+
import { globalLLMProviderHealth, LLMProviderHealthRegistry } from '@framers/agentos';
122+
123+
// Read state for an admin / diagnostics endpoint
124+
const stats = globalLLMProviderHealth.getStats('openrouter');
125+
if (stats?.state === 'open') {
126+
console.log(
127+
`OpenRouter circuit open; ${stats.cooldownRemainingMs}ms until close. ` +
128+
`Last status: ${stats.lastStatusCode}, total trips: ${stats.totalTrips}`,
129+
);
130+
}
131+
132+
// Manually reset after a credit top-up so the next call probes immediately
133+
globalLLMProviderHealth.reset('openrouter');
134+
135+
// Construct a private registry for a test
136+
const isolated = new LLMProviderHealthRegistry();
137+
isolated.recordFailure('mock-provider', new Error('[402] Test'));
138+
expect(isolated.isOpen('mock-provider')).toBe(true);
139+
```
140+
141+
### Why a singleton
142+
143+
Provider health is process-wide state. Two concurrent `generateText` calls inside the same Node process see the same OpenRouter: if one just discovered it's at 402, the other shouldn't redo the discovery. The `globalLLMProviderHealth` singleton is the natural granularity. Tests construct their own `LLMProviderHealthRegistry` instances to keep state isolated across cases.
144+
145+
The registry is **ephemeral by design**: it lives in memory and resets on server restart. Persistent provider-health tracking would add complexity (Redis, write-through cache invalidation) for a problem the in-process singleton already solves for the dominant case: a long-running batch job hammering a degraded provider.
146+
82147
## ActionDeduplicator
83148

84149
Hash-based recent action tracking with a configurable time window and LRU eviction. The caller computes the key string -- this class is intentionally generic. Use it to prevent duplicate votes, duplicate posts, or any repeated action within a window.

0 commit comments

Comments
 (0)