Light-Heart-Labs · Lightheartdevs · Feb 17, 2026 · Feb 17, 2026
diff --git a/docs/GUARDIAN.md b/docs/GUARDIAN.md
@@ -131,6 +131,61 @@ WantedBy=multi-user.target
 
 `Restart=always` ensures the guardian itself restarts if killed.
 
+### In Production
+
+A production guardian protecting a 3-agent system monitors ~42 resources across
+all four tiers, checking every 60 seconds. The resources include:
+
+- 3 agent gateway processes
+- Tool proxy, vLLM inference engine
+- Token Spy instances (one per cloud agent)
+- Memory Shepherd timers
+- Session cleanup timers
+- Supervisor bot process
+- ~30+ protected config and baseline files
+
+The guardian config file is a declarative "desired state" document — it lists
+every file hash and service that should be running. Each check cycle compares
+current state against desired state and takes corrective action.
+
+### Custom Health Checks
+
+Standard service monitoring (is the process running?) misses application-level
+failures. Custom health checks catch patterns that `systemctl status` can't:
+
+**Example: GPU Storm Recovery**
+
+When multiple agents spawn sub-agents simultaneously, the GPU gets flooded.
+One agent's requests start timing out, and its session gets stuck — but the
+process is still "running" as far as systemd knows.
+
+A custom health check detects this:
+
+```
+1. Check agent's gateway logs for timeout errors
+2. Check GPU queue depth — has the storm passed?
+3. If BOTH: (agent had timeouts) AND (GPU load is now normal)
+   → Restart the stuck agent's gateway
+   → Agent comes back online within ~2 minutes
+```
+
+This pattern — "detect the specific failure condition AND confirm the root
+cause has cleared" — prevents premature restarts that would fail for the same
+reason.
+
+### Incremental Backups
+
+Beyond Guardian's config snapshots, run incremental server backups on a
+separate timer (every 15 minutes):
+
+```bash
+rsync -a --link-dest="$PREV_SNAPSHOT" "$SOURCE/" "$SNAPSHOT_DIR/"
+```
+
+Hardlinks mean unchanged files don't take extra space — hundreds of snapshots
+fit in minimal disk. If something goes catastrophically wrong at 2pm, roll
+back to the 1:45pm state with a single command.
+
 ---
 
 ## Autonomy Tiers
@@ -238,14 +293,23 @@ Session Level (keeps agents running):
 
 System Level (keeps infrastructure intact):
   ├── Guardian             — monitors services, auto-recovers failures
+  ├── Custom Health Checks — catches application-level failures (GPU storms, etc.)
   ├── Autonomy Tiers       — explicit permission boundaries
   ├── Baseline Integrity   — immutable + checksummed identity files
   └── Self-Modification Rule — never hot-work your own infrastructure
+
+Operational Level (keeps humans informed):
+  ├── Supervisor Agent     — monitors team health, triggers resets, daily briefings
+  ├── Incremental Backups  — 15-minute snapshots, point-in-time recovery
+  └── Background Automation — commit watchdog, codebase indexer, test generator
 ```
 
 Session tools are documented in the main [README](../README.md),
 [TOKEN-SPY.md](TOKEN-SPY.md), and [memory-shepherd/README.md](../memory-shepherd/README.md).
-This doc covers the system-level complement.
+The supervisor pattern and background automation are in
+[MULTI-AGENT-PATTERNS.md](MULTI-AGENT-PATTERNS.md) and
+[OPERATIONAL-LESSONS.md](OPERATIONAL-LESSONS.md).
+This doc covers the system-level layer.
 
 **The goal is defense in depth.** No single protection catches everything.
 The session watchdog catches context overflow but not infrastructure damage.

diff --git a/docs/MULTI-AGENT-PATTERNS.md b/docs/MULTI-AGENT-PATTERNS.md
@@ -294,3 +294,124 @@ files:
 
 Keep coordination files small and focused. An agent reading STATUS.md should
 know in 10 seconds what's happening and what's blocked.
+
+---
+
+## The Supervisor Pattern
+
+In a multi-agent system, one agent should be the manager — not writing code,
+but keeping the team healthy and the human informed. This is a fundamentally
+different role than the worker agents.
+
+### What a Supervisor Does
+
+```
+Every 15 minutes:
+  1. Check git logs — is each agent making commits?
+  2. Check session health — is anyone's context bloated?
+  3. Check for stuck agents — has anyone been quiet too long?
+  4. Post a situation report to the shared channel
+  5. Trigger session resets for confused or bloated agents
+
+Daily (e.g., 6am):
+  6. Gather health metrics, spending data, commit counts, error logs
+  7. Compile a comprehensive briefing for the human operator
+  8. Include: what happened, what broke, what it cost, what needs attention
+```
+
+### Why a Separate Agent
+
+The supervisor needs a different model and different priorities than the
+workers:
+
+| Property | Worker Agents | Supervisor Agent |
+|---|---|---|
+| Primary model | Local (free, high volume) | Cloud (reliable, high judgment) |
+| Core activity | Writing code, running tests | Monitoring, reporting, resetting |
+| Failure mode | Gets stuck, loops, drifts | Must be reliable above all else |
+| Autonomy | Tier 0-1 (mostly autonomous) | Tier 2 (speaks with operator authority) |
+| Communication | Pushes code, posts to branches | DMs the human, posts ops reports |
+
+The supervisor should run on the most capable model you have — its job is
+judgment, not volume. A cheap model that misses a stuck agent costs more than
+an expensive model that catches it.
+
+### Supervisor Responsibilities
+
+**Health monitoring:**
+- Track commit frequency per agent (no commits for 2+ hours = investigate)
+- Monitor session file sizes (approaching context limit = trigger reset)
+- Watch for error patterns in logs (repeated timeouts = GPU contention)
+
+**Session management:**
+- Trigger session resets when agents get confused or bloated
+- The supervisor has authority to reset any worker agent's session
+- Resets are non-destructive — Memory Shepherd restores the baseline
+
+**Daily briefing:**
+- Compile 24h metrics: commits, costs, errors, uptime per agent
+- Highlight anomalies: cost spikes, idle agents, repeated failures
+- Include actionable items: "Agent X has been stuck since 3pm, needs
+  manual intervention"
+
+**Report cards:**
+- Periodic assessment of each agent's effectiveness
+- Are they completing tasks? Are they making mistakes? Are they idle?
+- Feed back into baseline updates (see
+  [WRITING-BASELINES.md](../memory-shepherd/docs/WRITING-BASELINES.md))
+
+### The Supervisor Is Protected
+
+The supervisor should be protected by the Guardian (see
+[GUARDIAN.md](GUARDIAN.md)) at the same tier as the agent gateways. If the
+supervisor goes down, nobody is watching the workers.
+
+The supervisor should NOT have write access to the same infrastructure the
+workers use. Its job is to observe and command, not to modify configs or
+restart services directly — that's the Guardian's job.
+
+---
+
+## A Typical Hour
+
+Here's what a healthy multi-agent system looks like in practice:
+
+```
+:00  Agent A is designing a new API endpoint. Spawns 3 sub-agents on
+     the local GPU: one for the handler, one for tests, one for docs.
+     All three run simultaneously at $0.
+
+:02  Agent B picks up Agent A's PR. Runs integration checks. Posts
+     review comments on the branch.
+
+:05  Agent C is grinding through a refactoring task entirely on the
+     local model. Commits every few minutes.
+
+:05  The commit watchdog (background) reviews Agent C's latest commit.
+     Posts "LGTM, no issues" to the shared channel.
+
+:10  Agent A's sub-agents finish. Handler, tests, and docs all written
+     in parallel. Agent A assembles the PR.
+
+:15  Supervisor checks in. Pulls git logs, checks session health.
+     Agent C's session is at 180KB — approaching the 256KB limit.
+     Supervisor posts: "Agent C session at 70%, will auto-reset at 100%."
+
+:20  Session watchdog fires. Agent C's session exceeds the threshold.
+     Watchdog deletes the session file, gateway creates a fresh one.
+     Agent C continues working — doesn't notice the swap.
+
+:30  Supervisor checks in again. All three agents active. Commits
+     flowing. No errors. Posts a green status report.
+
+:45  Agent B finishes integration tests. Merges Agent A's PR to main.
+     Agent C pulls latest, sees the new code.
+
+:60  Guardian runs its 60-second check. All services healthy, all
+     protected files intact. No action needed.
+```
+
+**The key insight:** Most of the time, nothing dramatic happens. The system
+runs itself. The value of the supervisor, guardian, and watchdog is in the
+5% of the time when something goes wrong — and it gets caught and fixed
+automatically instead of silently degrading for hours.
diff --git a/docs/OPERATIONAL-LESSONS.md b/docs/OPERATIONAL-LESSONS.md
@@ -292,3 +292,90 @@ that they answer different questions:
 - Failed retries: Token Spy shows the cost of attempts, vLLM shows the load
 
 Both are useful. Neither replaces the other.
+
+---
+
+## Background GPU Automation
+
+A GPU running local models for agents sits idle most of the time. Agents think
+in bursts — a few seconds of computation, then minutes of silence while they
+read files, run commands, or wait. Those idle cycles can do real work.
+
+### Commit Watchdog
+
+Every agent commit gets automatically reviewed by the local model.
+
+```
+Every 5 minutes:
+  1. Check for new commits from any agent
+  2. For each new commit: pull the diff
+  3. Send to local model: "Are there broken imports? Obvious bugs?
+     Security issues? Anything suspicious?"
+  4. Post the review to the shared channel
+```
+
+At ~500 agent commits per day and ~5 seconds per review, this adds about 45
+minutes of GPU time daily. Free QA for every push.
+
+The reviews aren't deep architectural analysis — they're fast sanity checks.
+Catching a broken import before it wastes another agent's time pays for itself
+immediately.
+
+### Codebase Indexer
+
+Once a day (e.g., 5am before the morning briefing), walk the entire codebase:
+
+1. Split files into chunks
+2. Generate embedding vectors for each chunk
+3. Store in a vector database (e.g., Qdrant)
+4. Content-hash files so unchanged files get skipped on subsequent runs
+
+This enables **semantic search** — agents can ask "find me the code that
+handles authentication" instead of relying on keyword matching. The index
+stays fresh because it rebuilds daily.
+
+### Test Generator
+
+When the commit watchdog detects new source files without corresponding test
+files:
+
+1. Read the source file
+2. Send to local model: "Write pytest-style test stubs covering happy path,
+   edge cases, and error handling"
+3. Save to a staging area with a `# NEEDS REVIEW` header
+4. Never commit automatically — these are starting points, not finished tests
+
+This turns idle GPU cycles into test coverage scaffolding. A human or agent
+can refine the stubs, but the hard part — reading the code and thinking about
+what to test — is already done.
+
+### Briefing Enrichment
+
+Before generating a daily briefing (see the supervisor pattern in
+[MULTI-AGENT-PATTERNS.md](MULTI-AGENT-PATTERNS.md)), pass raw health data
+through the local model for pre-analysis:
+
+- Error classification (transient vs. systemic)
+- Root cause suggestions
+- Trend detection (is cost increasing? are errors clustering?)
+
+This adds ~30 seconds of GPU time but makes the briefing significantly more
+actionable than raw metrics.
+
+### GPU Duty Cycle
+
+With all four background systems running alongside agent workloads, expect
+15-50% GPU utilization depending on agent activity. Not bad for cycles that
+would otherwise be wasted.
+
+| System | GPU Time/Day | Trigger |
+|---|---|---|
+| Commit watchdog | ~45 min | Every 5 min (new commits) |
+| Codebase indexer | ~15 min | Once daily (5am) |
+| Test generator | ~10 min | On new files (via watchdog) |
+| Briefing enrichment | ~1 min | Once daily (before briefing) |
+
+None of these block agent work — they run during idle windows. If an agent
+needs the GPU, inference requests from the background systems simply queue
+behind the agent's requests (vLLM's continuous batching handles this
+transparently).