diff --git a/docs/GUARDIAN.md b/docs/GUARDIAN.md index 8e202de6e..4f5a45d4f 100644 --- a/docs/GUARDIAN.md +++ b/docs/GUARDIAN.md @@ -131,6 +131,61 @@ WantedBy=multi-user.target `Restart=always` ensures the guardian itself restarts if killed. +### In Production + +A production guardian protecting a 3-agent system monitors ~42 resources across +all four tiers, checking every 60 seconds. The resources include: + +- 3 agent gateway processes +- Tool proxy, vLLM inference engine +- Token Spy instances (one per cloud agent) +- Memory Shepherd timers +- Session cleanup timers +- Supervisor bot process +- ~30+ protected config and baseline files + +The guardian config file is a declarative "desired state" document — it lists +every file hash and service that should be running. Each check cycle compares +current state against desired state and takes corrective action. + +### Custom Health Checks + +Standard service monitoring (is the process running?) misses application-level +failures. Custom health checks catch patterns that `systemctl status` can't: + +**Example: GPU Storm Recovery** + +When multiple agents spawn sub-agents simultaneously, the GPU gets flooded. +One agent's requests start timing out, and its session gets stuck — but the +process is still "running" as far as systemd knows. + +A custom health check detects this: + +``` +1. Check agent's gateway logs for timeout errors +2. Check GPU queue depth — has the storm passed? +3. If BOTH: (agent had timeouts) AND (GPU load is now normal) + → Restart the stuck agent's gateway + → Agent comes back online within ~2 minutes +``` + +This pattern — "detect the specific failure condition AND confirm the root +cause has cleared" — prevents premature restarts that would fail for the same +reason. + +### Incremental Backups + +Beyond Guardian's config snapshots, run incremental server backups on a +separate timer (every 15 minutes): + +```bash +rsync -a --link-dest="$PREV_SNAPSHOT" "$SOURCE/" "$SNAPSHOT_DIR/" +``` + +Hardlinks mean unchanged files don't take extra space — hundreds of snapshots +fit in minimal disk. If something goes catastrophically wrong at 2pm, roll +back to the 1:45pm state with a single command. + --- ## Autonomy Tiers @@ -238,14 +293,23 @@ Session Level (keeps agents running): System Level (keeps infrastructure intact): ├── Guardian — monitors services, auto-recovers failures + ├── Custom Health Checks — catches application-level failures (GPU storms, etc.) ├── Autonomy Tiers — explicit permission boundaries ├── Baseline Integrity — immutable + checksummed identity files └── Self-Modification Rule — never hot-work your own infrastructure + +Operational Level (keeps humans informed): + ├── Supervisor Agent — monitors team health, triggers resets, daily briefings + ├── Incremental Backups — 15-minute snapshots, point-in-time recovery + └── Background Automation — commit watchdog, codebase indexer, test generator ``` Session tools are documented in the main [README](../README.md), [TOKEN-SPY.md](TOKEN-SPY.md), and [memory-shepherd/README.md](../memory-shepherd/README.md). -This doc covers the system-level complement. +The supervisor pattern and background automation are in +[MULTI-AGENT-PATTERNS.md](MULTI-AGENT-PATTERNS.md) and +[OPERATIONAL-LESSONS.md](OPERATIONAL-LESSONS.md). +This doc covers the system-level layer. **The goal is defense in depth.** No single protection catches everything. The session watchdog catches context overflow but not infrastructure damage. diff --git a/docs/MULTI-AGENT-PATTERNS.md b/docs/MULTI-AGENT-PATTERNS.md index 91803d495..94536230b 100644 --- a/docs/MULTI-AGENT-PATTERNS.md +++ b/docs/MULTI-AGENT-PATTERNS.md @@ -294,3 +294,124 @@ files: Keep coordination files small and focused. An agent reading STATUS.md should know in 10 seconds what's happening and what's blocked. + +--- + +## The Supervisor Pattern + +In a multi-agent system, one agent should be the manager — not writing code, +but keeping the team healthy and the human informed. This is a fundamentally +different role than the worker agents. + +### What a Supervisor Does + +``` +Every 15 minutes: + 1. Check git logs — is each agent making commits? + 2. Check session health — is anyone's context bloated? + 3. Check for stuck agents — has anyone been quiet too long? + 4. Post a situation report to the shared channel + 5. Trigger session resets for confused or bloated agents + +Daily (e.g., 6am): + 6. Gather health metrics, spending data, commit counts, error logs + 7. Compile a comprehensive briefing for the human operator + 8. Include: what happened, what broke, what it cost, what needs attention +``` + +### Why a Separate Agent + +The supervisor needs a different model and different priorities than the +workers: + +| Property | Worker Agents | Supervisor Agent | +|---|---|---| +| Primary model | Local (free, high volume) | Cloud (reliable, high judgment) | +| Core activity | Writing code, running tests | Monitoring, reporting, resetting | +| Failure mode | Gets stuck, loops, drifts | Must be reliable above all else | +| Autonomy | Tier 0-1 (mostly autonomous) | Tier 2 (speaks with operator authority) | +| Communication | Pushes code, posts to branches | DMs the human, posts ops reports | + +The supervisor should run on the most capable model you have — its job is +judgment, not volume. A cheap model that misses a stuck agent costs more than +an expensive model that catches it. + +### Supervisor Responsibilities + +**Health monitoring:** +- Track commit frequency per agent (no commits for 2+ hours = investigate) +- Monitor session file sizes (approaching context limit = trigger reset) +- Watch for error patterns in logs (repeated timeouts = GPU contention) + +**Session management:** +- Trigger session resets when agents get confused or bloated +- The supervisor has authority to reset any worker agent's session +- Resets are non-destructive — Memory Shepherd restores the baseline + +**Daily briefing:** +- Compile 24h metrics: commits, costs, errors, uptime per agent +- Highlight anomalies: cost spikes, idle agents, repeated failures +- Include actionable items: "Agent X has been stuck since 3pm, needs + manual intervention" + +**Report cards:** +- Periodic assessment of each agent's effectiveness +- Are they completing tasks? Are they making mistakes? Are they idle? +- Feed back into baseline updates (see + [WRITING-BASELINES.md](../memory-shepherd/docs/WRITING-BASELINES.md)) + +### The Supervisor Is Protected + +The supervisor should be protected by the Guardian (see +[GUARDIAN.md](GUARDIAN.md)) at the same tier as the agent gateways. If the +supervisor goes down, nobody is watching the workers. + +The supervisor should NOT have write access to the same infrastructure the +workers use. Its job is to observe and command, not to modify configs or +restart services directly — that's the Guardian's job. + +--- + +## A Typical Hour + +Here's what a healthy multi-agent system looks like in practice: + +``` +:00 Agent A is designing a new API endpoint. Spawns 3 sub-agents on + the local GPU: one for the handler, one for tests, one for docs. + All three run simultaneously at $0. + +:02 Agent B picks up Agent A's PR. Runs integration checks. Posts + review comments on the branch. + +:05 Agent C is grinding through a refactoring task entirely on the + local model. Commits every few minutes. + +:05 The commit watchdog (background) reviews Agent C's latest commit. + Posts "LGTM, no issues" to the shared channel. + +:10 Agent A's sub-agents finish. Handler, tests, and docs all written + in parallel. Agent A assembles the PR. + +:15 Supervisor checks in. Pulls git logs, checks session health. + Agent C's session is at 180KB — approaching the 256KB limit. + Supervisor posts: "Agent C session at 70%, will auto-reset at 100%." + +:20 Session watchdog fires. Agent C's session exceeds the threshold. + Watchdog deletes the session file, gateway creates a fresh one. + Agent C continues working — doesn't notice the swap. + +:30 Supervisor checks in again. All three agents active. Commits + flowing. No errors. Posts a green status report. + +:45 Agent B finishes integration tests. Merges Agent A's PR to main. + Agent C pulls latest, sees the new code. + +:60 Guardian runs its 60-second check. All services healthy, all + protected files intact. No action needed. +``` + +**The key insight:** Most of the time, nothing dramatic happens. The system +runs itself. The value of the supervisor, guardian, and watchdog is in the +5% of the time when something goes wrong — and it gets caught and fixed +automatically instead of silently degrading for hours. diff --git a/docs/OPERATIONAL-LESSONS.md b/docs/OPERATIONAL-LESSONS.md index 6c1cee64e..72945c99e 100644 --- a/docs/OPERATIONAL-LESSONS.md +++ b/docs/OPERATIONAL-LESSONS.md @@ -292,3 +292,90 @@ that they answer different questions: - Failed retries: Token Spy shows the cost of attempts, vLLM shows the load Both are useful. Neither replaces the other. + +--- + +## Background GPU Automation + +A GPU running local models for agents sits idle most of the time. Agents think +in bursts — a few seconds of computation, then minutes of silence while they +read files, run commands, or wait. Those idle cycles can do real work. + +### Commit Watchdog + +Every agent commit gets automatically reviewed by the local model. + +``` +Every 5 minutes: + 1. Check for new commits from any agent + 2. For each new commit: pull the diff + 3. Send to local model: "Are there broken imports? Obvious bugs? + Security issues? Anything suspicious?" + 4. Post the review to the shared channel +``` + +At ~500 agent commits per day and ~5 seconds per review, this adds about 45 +minutes of GPU time daily. Free QA for every push. + +The reviews aren't deep architectural analysis — they're fast sanity checks. +Catching a broken import before it wastes another agent's time pays for itself +immediately. + +### Codebase Indexer + +Once a day (e.g., 5am before the morning briefing), walk the entire codebase: + +1. Split files into chunks +2. Generate embedding vectors for each chunk +3. Store in a vector database (e.g., Qdrant) +4. Content-hash files so unchanged files get skipped on subsequent runs + +This enables **semantic search** — agents can ask "find me the code that +handles authentication" instead of relying on keyword matching. The index +stays fresh because it rebuilds daily. + +### Test Generator + +When the commit watchdog detects new source files without corresponding test +files: + +1. Read the source file +2. Send to local model: "Write pytest-style test stubs covering happy path, + edge cases, and error handling" +3. Save to a staging area with a `# NEEDS REVIEW` header +4. Never commit automatically — these are starting points, not finished tests + +This turns idle GPU cycles into test coverage scaffolding. A human or agent +can refine the stubs, but the hard part — reading the code and thinking about +what to test — is already done. + +### Briefing Enrichment + +Before generating a daily briefing (see the supervisor pattern in +[MULTI-AGENT-PATTERNS.md](MULTI-AGENT-PATTERNS.md)), pass raw health data +through the local model for pre-analysis: + +- Error classification (transient vs. systemic) +- Root cause suggestions +- Trend detection (is cost increasing? are errors clustering?) + +This adds ~30 seconds of GPU time but makes the briefing significantly more +actionable than raw metrics. + +### GPU Duty Cycle + +With all four background systems running alongside agent workloads, expect +15-50% GPU utilization depending on agent activity. Not bad for cycles that +would otherwise be wasted. + +| System | GPU Time/Day | Trigger | +|---|---|---| +| Commit watchdog | ~45 min | Every 5 min (new commits) | +| Codebase indexer | ~15 min | Once daily (5am) | +| Test generator | ~10 min | On new files (via watchdog) | +| Briefing enrichment | ~1 min | Once daily (before briefing) | + +None of these block agent work — they run during idle windows. If an agent +needs the GPU, inference requests from the background systems simply queue +behind the agent's requests (vLLM's continuous batching handles this +transparently).