diff --git a/README.md b/README.md
index a41896276..cdc2d84ec 100644
--- a/README.md
+++ b/README.md
@@ -37,6 +37,9 @@ Defines a `---` separator convention: everything above is operator-controlled id
 ### Architecture Docs
 Deep-dive documentation on how OpenClaw talks to vLLM, why the proxy exists, how session files work, and the five failure points that kill local setups.
 
+### Operational Guides
+Lessons learned from running agents 24/7, multi-agent coordination patterns, and infrastructure protection strategies — all discovered by persistent agents running on local hardware. See the [docs/](docs/) directory.
+
 ---
 
 ## Quick Start
@@ -268,7 +271,10 @@ LightHeart-OpenClaw/
 ├── docs/
 │   ├── SETUP.md                       # Full local setup guide
 │   ├── ARCHITECTURE.md                # How it all fits together
-│   └── TOKEN-SPY.md                   # Token Spy setup & API reference
+│   ├── TOKEN-SPY.md                   # Token Spy setup & API reference
+│   ├── OPERATIONAL-LESSONS.md         # Hard-won lessons from 24/7 agent ops
+│   ├── MULTI-AGENT-PATTERNS.md        # Coordination, swarms, and reliability
+│   └── GUARDIAN.md                    # Infrastructure protection & autonomy tiers
 └── LICENSE
 ```
 
diff --git a/docs/GUARDIAN.md b/docs/GUARDIAN.md
new file mode 100644
index 000000000..8e202de6e
--- /dev/null
+++ b/docs/GUARDIAN.md
@@ -0,0 +1,253 @@
+# Infrastructure Protection — Guardians, Autonomy Tiers, and Safety Nets
+
+Agents with filesystem access and shell execution can — and will — break their
+own infrastructure. This doc covers patterns for preventing that: immutable
+watchdogs, explicit permission tiers, and the self-modification problem.
+
+These patterns complement the session-level protections (session watchdog,
+Memory Shepherd) with system-level protections. Session tools keep agents
+*running*; these patterns keep agents from *breaking what they run on*.
+
+---
+
+## The Problem
+
+Persistent agents with tool access can:
+
+- Kill their own gateway process while debugging something else
+- Modify configs they depend on (proxy, vLLM, systemd services)
+- Fill disks with log output or generated files
+- Restart services during active sessions, losing state
+- Overwrite their own baseline files (the ones Memory Shepherd restores from)
+
+These aren't hypothetical. They happen when agents are resourceful — which is
+exactly the behavior you want, applied to the wrong target.
+
+---
+
+## The Guardian Pattern
+
+A guardian is a watchdog process that monitors critical infrastructure and
+auto-recovers from failures. The key property: **agents cannot modify or
+disable it.**
+
+### Design Principles
+
+1. **Runs as root** (or a privileged user the agent can't impersonate)
+2. **Immutable** — `chattr +i` on the script file prevents modification
+3. **Self-healing** — re-sets its own immutable flags if cleared
+4. **Tiered monitoring** — not everything is equally critical
+5. **Conservative recovery** — soft restart first, restore from backup only
+   after repeated failures
+
+### What It Monitors
+
+Organize monitored resources into tiers by criticality:
+
+| Tier | Resources | Recovery |
+|---|---|---|
+| 1 (Critical) | Agent gateway, API proxy | Auto-restart, alert on failure |
+| 2 (Important) | vLLM inference, tool proxy | Auto-restart with backoff |
+| 3 (Supporting) | Token Spy, monitoring, cron | Auto-restart, no alert |
+| 4 (Auxiliary) | Log rotation, cleanup timers | Log warning, skip |
+
+### Recovery Strategy
+
+```
+Service fails
+  → Soft restart (systemctl restart)
+  → Wait 10s, check again
+  → If still down: soft restart #2
+  → Wait 10s, check again
+  → If still down: soft restart #3
+  → Wait 10s, check again
+  → If still down after 3 attempts: restore config from backup, restart
+  → Alert operator
+```
+
+Three-strike recovery prevents thrashing while giving transient failures a
+chance to self-resolve.
+
+### Snapshot Generations
+
+Keep N generations of known-good configs (default: 5):
+
+```
+/var/lib/guardian/backups/
+  ├── gen-1/    # Most recent known-good
+  ├── gen-2/
+  ├── gen-3/
+  ├── gen-4/
+  └── gen-5/    # Oldest
+```
+
+When a service can't recover after 3 restarts, restore from the most recent
+generation. If that fails, try the next. This handles the case where the
+most recent "good" config was actually the one that broke things.
+
+### Making It Immutable
+
+```bash
+# Install the guardian script
+sudo cp guardian.sh /usr/local/bin/guardian.sh
+sudo chmod 755 /usr/local/bin/guardian.sh
+
+# Make it immutable — even root needs to explicitly remove this flag
+sudo chattr +i /usr/local/bin/guardian.sh
+
+# Verify
+lsattr /usr/local/bin/guardian.sh
+# Should show: ----i--------e-- /usr/local/bin/guardian.sh
+```
+
+**Why immutable matters:** An agent with sudo access or root shell can `rm` or
+edit any file. `chattr +i` prevents modification even by root — the flag must
+be explicitly removed first. The guardian re-sets its own flag on startup,
+so even if an agent manages to clear it, the next check cycle restores it.
+
+To update the guardian:
+```bash
+sudo chattr -i /usr/local/bin/guardian.sh
+sudo cp guardian-v2.sh /usr/local/bin/guardian.sh
+sudo chattr +i /usr/local/bin/guardian.sh
+```
+
+### Systemd Integration
+
+```ini
+[Unit]
+Description=Infrastructure Guardian
+After=network.target
+
+[Service]
+Type=simple
+ExecStart=/usr/local/bin/guardian.sh
+Restart=always
+RestartSec=10
+
+[Install]
+WantedBy=multi-user.target
+```
+
+`Restart=always` ensures the guardian itself restarts if killed.
+
+---
+
+## Autonomy Tiers
+
+Tell agents explicitly what they can and can't do. The most effective pattern
+is a tiered system — not a flat list of rules.
+
+### The Tiers
+
+| Tier | Label | Examples | Rationale |
+|---|---|---|---|
+| 0 | **Just do it** | Read files, run tests, draft PRs, push to feature branches, research, claim work, update scratch notes | Low risk, high frequency. Asking permission for these wastes cycles. |
+| 1 | **Peer review** | Config changes to local services, new tools before deploy, research conclusions before sharing | Medium risk. Another agent or a quick human check prevents mistakes. |
+| 2 | **Escalate** | Production systems, external communications, spending money, irreversible actions, OpenClaw/vLLM config changes | High risk. Always requires human approval. |
+
+### Implementing Tiers in Baselines
+
+Add autonomy tiers to your agent's baseline (see
+[WRITING-BASELINES.md](../memory-shepherd/docs/WRITING-BASELINES.md)):
+
+```markdown
+## Autonomy Tiers
+
+**Tier 0 — Just do it:** Chat, research, experiments, repo pushes,
+test runs, claiming work, opinions, scratch notes.
+
+**Tier 1 — Peer review:** Config changes, new tools, research
+conclusions. Get a review from [reviewer agent] or a human.
+
+**Tier 2 — Escalate:** Production infrastructure, external comms,
+money, anything irreversible. Always ask [human operator].
+```
+
+The key is making tiers concrete with examples. "Be careful with production"
+is Tier 2 phrased vaguely. "Never touch the production database without
+explicit approval from the operator" is Tier 2 phrased usefully.
+
+### The Self-Modification Rule
+
+If an agent's code touches its **own** infrastructure, it must not modify
+it directly:
+
+1. Spawn a dev environment (separate machine, container, or branch)
+2. Make changes there
+3. Test and validate
+4. Promote to production only after verification
+
+**Why:** An agent that modifies the gateway it runs on can crash itself
+mid-operation. There's no recovery from "I broke the thing that runs me."
+
+This is the production hot-work lesson (see
+[OPERATIONAL-LESSONS.md](OPERATIONAL-LESSONS.md)) formalized as a rule.
+
+---
+
+## Baseline Integrity Protection
+
+Memory Shepherd's baseline files are critical — they define who each agent is
+after every reset. If baselines get corrupted, agents get bad resets.
+
+### Immutable Baselines
+
+```bash
+# Lock baseline files
+sudo chattr +i memory-shepherd/baselines/*.md
+
+# To update, temporarily unlock
+sudo chattr -i memory-shepherd/baselines/my-agent-MEMORY.md
+vim memory-shepherd/baselines/my-agent-MEMORY.md
+sudo chattr +i memory-shepherd/baselines/my-agent-MEMORY.md
+```
+
+### Checksum Validation
+
+```bash
+# Generate checksums after writing baselines
+sha256sum memory-shepherd/baselines/*.md > memory-shepherd/baselines/.checksums
+
+# Verify before each reset
+sha256sum --check memory-shepherd/baselines/.checksums || echo "TAMPERING DETECTED"
+```
+
+Add the checksum verification to the Memory Shepherd workflow or as a
+pre-reset hook.
+
+### Version Control
+
+Keep baselines in version control with the rest of the repo. This gives you:
+- Full change history (who changed what, when)
+- Rollback capability (`git checkout <hash> -- baselines/`)
+- Diff visibility (`git diff` shows exactly what changed)
+- Branch-based review for baseline updates
+
+---
+
+## Combining Protections
+
+The full protection stack, from session level to system level:
+
+```
+Session Level (keeps agents running):
+  ├── Session Watchdog     — prevents context overflow crashes
+  ├── Token Spy            — monitors cost, auto-resets bloated sessions
+  └── Memory Shepherd      — resets memory to baseline, prevents drift
+
+System Level (keeps infrastructure intact):
+  ├── Guardian             — monitors services, auto-recovers failures
+  ├── Autonomy Tiers       — explicit permission boundaries
+  ├── Baseline Integrity   — immutable + checksummed identity files
+  └── Self-Modification Rule — never hot-work your own infrastructure
+```
+
+Session tools are documented in the main [README](../README.md),
+[TOKEN-SPY.md](TOKEN-SPY.md), and [memory-shepherd/README.md](../memory-shepherd/README.md).
+This doc covers the system-level complement.
+
+**The goal is defense in depth.** No single protection catches everything.
+The session watchdog catches context overflow but not infrastructure damage.
+The guardian catches service failures but not identity drift. Together, they
+cover the full failure surface of persistent agents.
diff --git a/docs/MULTI-AGENT-PATTERNS.md b/docs/MULTI-AGENT-PATTERNS.md
new file mode 100644
index 000000000..91803d495
--- /dev/null
+++ b/docs/MULTI-AGENT-PATTERNS.md
@@ -0,0 +1,296 @@
+# Multi-Agent Patterns — Coordination, Reliability, and Swarms
+
+Patterns for running multiple agents together. Covers coordination protocols,
+reliability through redundancy, sub-agent spawning, and the failure modes that
+emerge when agents collaborate.
+
+These patterns were developed running 3+ persistent agents on local hardware.
+They apply to any multi-agent setup — OpenClaw, cloud APIs, or mixed.
+
+---
+
+## Coordination: The Sync Protocol
+
+When multiple agents share a codebase, you need rules for who changes what and
+when. Without them, agents overwrite each other's work, merge conflicts pile
+up, and nobody knows what's current.
+
+### Branch-Based Review Pipeline
+
+```
+Agent A creates feature branch → builds → pushes
+                                            ↓
+                                    Agent B reviews branch
+                                     ↓              ↓
+                                Approved         Needs changes
+                                   ↓                  ↓
+                           Agent B merges        Agent A fixes, re-pushes
+                             to main                  ↓
+                                   ↓            Agent B re-reviews
+                           Agent C validates
+                          (integration test)
+```
+
+**Branch naming:** Use agent-identifiable prefixes:
+- `agent-1/short-description`
+- `agent-2/short-description`
+- `reviewer/short-description` (rare — reviewers mostly review)
+
+### What Needs Review vs. What Doesn't
+
+| Needs Branch + Review | Goes Direct to Main |
+|---|---|
+| All code changes (.py, .js, .ts, .sh, .yaml) | Status updates, project boards |
+| New tools or scripts | Research docs, notes |
+| Product code | Daily logs, memory files |
+| Infrastructure configs | Test results, benchmarks |
+
+The split is: **code and config through branches, docs and status direct to
+main.** This keeps the review pipeline focused on changes that can break things.
+
+### Heartbeat Protocol
+
+For always-on agents, run a periodic sync (every 15-60 minutes):
+
+1. Pull latest from main
+2. Check the project board for unclaimed work
+3. Check for pending reviews from other agents
+4. Check for handoffs or messages from siblings
+5. Claim work, push results, update status
+
+The heartbeat prevents drift between agents and catches handoffs that would
+otherwise sit idle.
+
+---
+
+## Reliability Through Redundancy
+
+### The Math
+
+Single local model agents have inherent reliability limits. From empirical
+testing:
+
+| Setup | Pattern | Success Rate |
+|---|---|---|
+| 1 agent | Single attempt | ~67-77% |
+| 2 agents | Any-success (take first) | ~95% |
+| 3 agents | 2-of-3 voting | ~93% |
+| 5 agents | 3-of-5 voting | ~97% |
+
+**The simplest upgrade:** Spawn 2 agents on the same task, take the first
+successful result. This takes reliability from ~70% to ~95% at 2x compute
+cost — but on local hardware, compute is free.
+
+### When to Use Redundancy
+
+- **Critical tasks** where failure means manual intervention
+- **Tasks with clear success criteria** (file exists, test passes, output matches)
+- **Idempotent operations** where running twice causes no harm
+
+Don't use redundancy for:
+- Tasks with side effects (sending emails, posting messages)
+- Tasks that modify shared state (unless you handle conflicts)
+- Exploratory tasks where "different answer" isn't "wrong answer"
+
+---
+
+## Sub-Agent Spawning
+
+### Task Templates That Work
+
+The difference between a 30% and 90% success rate often comes down to how the
+task is written.
+
+**High success (~90%):**
+
+```
+You are a [ROLE] agent.
+
+Complete ALL of these steps:
+
+1. Run: ssh user@192.168.0.100 "[COMMAND_1]"
+2. Run: ssh user@192.168.0.100 "[COMMAND_2]"
+3. Run: ssh user@192.168.0.100 "[COMMAND_3]"
+4. Write ALL findings to: /absolute/path/to/output.md
+
+Include raw command outputs. Do not summarize or omit.
+Do not stop until the file is written.
+Reply "Done". Do not output JSON. Do not loop.
+```
+
+**What makes it work:**
+1. Explicit commands (not "check the system" — actual commands to run)
+2. Numbered steps (1, 2, 3 — not prose paragraphs)
+3. Absolute file paths (not relative, not "save it somewhere")
+4. Reinforcement ("do not stop until the file is written")
+5. Stop prompt ("Reply Done. Do not output JSON. Do not loop.")
+6. Single focus (one role, one objective)
+
+**Low success (~30-40%):**
+- Indirect instructions: "SSH as: user@host" instead of "Run: ssh user@host ..."
+- Ambiguous scope: "Document all security configuration"
+- Multi-server tasks: "Check both server A and server B"
+- Open-ended exploration: "Look around and report what you find"
+- Complex conditional logic in a single task
+
+### When to Spawn vs. Do Directly
+
+**Rule of thumb:** If you can write the task as one clear sentence with no
+"and then," it's spawn-able.
+
+| Spawn | Do Directly |
+|---|---|
+| Pure research, multiple independent questions | Needs tool execution with complex chains |
+| Repetitive validation across artifacts | Time-sensitive, need it now |
+| Document generation from clear templates | Complex multi-step workflows |
+| Data gathering, parallel searches | Tasks requiring decisions mid-execution |
+
+### Resource Management
+
+Each sub-agent consumes GPU memory. On a single GPU:
+
+| GPU Load | Concurrent Agents | Recommendation |
+|---|---|---|
+| Light | 1-4 | Fast, reliable |
+| Medium | 5-8 | Good throughput, optimal sweet spot |
+| Heavy | 9-12 | Some queuing expected |
+| Overloaded | 13+ | Timeouts likely |
+
+**Pre-spawn health check:**
+```bash
+# Check VRAM before spawning
+curl localhost:9199/status | jq '.nodes[].vram_percent'
+# If > 90%, defer spawning or use a lighter approach
+```
+
+**Timeouts are mandatory.** Without `runTimeoutSeconds`, local models can loop
+indefinitely. Recommended values:
+
+| Task Complexity | Timeout |
+|---|---|
+| Simple (file write, single command) | 60s |
+| Multi-step (3-5 actions) | 120s |
+| Complex research | 180s |
+
+### Spawning Patterns
+
+**Pattern 1: Research Fan-Out**
+
+Spawn N agents, each with one focused question. Each writes findings to a
+specific file. Coordinator aggregates.
+
+```
+Coordinator
+  ├── Agent 1: "What are the top 3 embedding models for code search?"
+  │                → writes to /tmp/research/embeddings.md
+  ├── Agent 2: "What vector databases support hybrid search?"
+  │                → writes to /tmp/research/vector-dbs.md
+  └── Agent 3: "What's the state of the art in code chunking?"
+                   → writes to /tmp/research/chunking.md
+```
+
+**Constraint:** Each agent gets ONE question. Don't overload.
+
+**Pattern 2: Validation Sweep**
+
+Define validation criteria. Spawn one agent per artifact. Agents report
+pass/fail with specific issues.
+
+Good for: testing multiple configs, validating documentation accuracy,
+checking multiple endpoints.
+
+**Pattern 3: Document Generation**
+
+Define a template. Spawn agents with specific content assignments. Works well
+for API docs, how-to guides, research summaries.
+
+Fails for: docs requiring tool execution, cross-file coordination, or content
+that depends on other agents' output.
+
+### Anti-Patterns
+
+| Anti-Pattern | Why It Fails |
+|---|---|
+| Tool-heavy sub-agents | Local models output tool calls as plain text JSON |
+| Overloaded task scope | Too many objectives = shallow coverage on all of them |
+| Cross-agent dependencies | Sub-agents can't read each other's output mid-run |
+| Long-running complex chains | Multi-step workflows with decision points derail |
+
+---
+
+## Echo Chamber Prevention
+
+When multiple agents work together, they can amplify each other's assumptions.
+This is the most dangerous multi-agent failure mode because it looks like
+productive collaboration.
+
+### The Pattern
+
+1. Agent A claims something is working
+2. Agent B agrees without independent verification
+3. Agent C builds on the claim
+4. All three celebrate success
+5. Nobody checked if the files actually exist
+
+### The Protocol
+
+**One-Lead Rule:** For debugging sessions, one agent investigates. Others
+standby. Multiple agents poking at the same problem simultaneously creates
+noise, not signal.
+
+**Verify Before Claiming:** "Works" means:
+- File exists on disk (not just "I wrote it")
+- End-to-end test passed (not just "it should work")
+- Output matches expectations (not just "no errors")
+
+**Red Flag — Rapid Fire:** If 3+ messages fly between agents in quick
+succession, everyone pauses. Fast agreement without verification is a signal,
+not progress.
+
+**Stop Means Stop:** When told to stop, acknowledge with ONE message, then
+silence. Don't negotiate, don't add "one more thing."
+
+**Skepticism > Agreement:** Never "+1" without independent verification.
+If Agent A says it works, Agent B should check independently before agreeing.
+
+---
+
+## Division of Labor
+
+If you run both local and cloud models, formalize who does what:
+
+| Task Type | Assign To | Rationale |
+|---|---|---|
+| Testing, benchmarking, iteration | Local agent | Zero cost, unlimited retries |
+| Large file analysis (>32K tokens) | Local agent | Large context at $0 |
+| Code generation, boilerplate | Local agent | Volume work, low judgment |
+| Integration testing | Cloud agent | Multi-system reasoning |
+| Architecture, code review | Cloud agent | Nuance worth the cost |
+| Complex debugging | Cloud agent | Error recovery, judgment calls |
+
+**The savings compound.** Each test run a local agent handles saves a cloud API
+call. Over a day of development, this adds up to $50-100+ in saved API costs.
+
+For burn rate tracking, see [TOKEN-SPY.md](TOKEN-SPY.md). Token Spy shows
+per-agent cost so you can verify the split is working.
+
+---
+
+## Status & Coordination Files
+
+For teams of agents sharing a repo, establish conventions for coordination
+files:
+
+| File | Purpose | Update Frequency | Max Size |
+|---|---|---|---|
+| `STATUS.md` | Who's doing what right now | Every heartbeat | ~100 lines |
+| `PROJECTS.md` | Work board with ownership | When work changes | No limit |
+| `MISSIONS.md` | North-star priorities | Rarely | Short |
+| `memory/YYYY-MM-DD.md` | Daily log of what happened | Continuously | No limit |
+
+**STATUS.md** is ephemeral — it reflects current state only, not history.
+**PROJECTS.md** is the work board — agents check it for unclaimed tasks.
+**Daily logs** are the audit trail — what happened, when, and by whom.
+
+Keep coordination files small and focused. An agent reading STATUS.md should
+know in 10 seconds what's happening and what's blocked.
diff --git a/docs/OPERATIONAL-LESSONS.md b/docs/OPERATIONAL-LESSONS.md
new file mode 100644
index 000000000..6c1cee64e
--- /dev/null
+++ b/docs/OPERATIONAL-LESSONS.md
@@ -0,0 +1,294 @@
+# Operational Lessons — What We Learned Running Agents 24/7
+
+Hard-won lessons from running persistent LLM agents on local hardware. These
+aren't theoretical — they're from real incidents, real failures, and real fixes
+discovered by the agents themselves.
+
+If you're running agents that stay up for hours or days, you'll eventually hit
+most of these. Might as well learn from our mistakes.
+
+---
+
+## Silent Failures
+
+### Parser Mismatch Is Silent
+
+Using the wrong `--tool-call-parser` doesn't produce an error. The model loads
+fine, accepts requests, and returns responses — but tool calls come back as
+plain text instead of structured JSON.
+
+**Symptom:** Agent seems to work but never actually executes tools. Content
+field contains JSON-looking text instead of proper `tool_calls`.
+
+**Fix:** Match the parser to the model:
+
+| Model Family | Parser |
+|---|---|
+| Qwen3-Coder-Next | `qwen3_coder` |
+| Qwen2.5-Coder | `hermes` |
+| Qwen2.5 Instruct | `hermes` |
+| Qwen3-8B/32B | `hermes` |
+
+The tool proxy (see [ARCHITECTURE.md](ARCHITECTURE.md)) catches some of these
+as a safety net — it extracts tool calls from text content — but native parsing
+is always more reliable.
+
+### Compat Flags Fail Silently
+
+vLLM doesn't reject unknown parameters — it silently ignores them. If you're
+missing the `compat` block in `openclaw.json`, requests appear to succeed but
+produce garbage or empty responses.
+
+See the README for the four critical compat flags.
+
+---
+
+## Session & Memory Management
+
+### Pre-Compaction Memory Flush
+
+Before the session watchdog or Token Spy resets a session (see
+[TOKEN-SPY.md](TOKEN-SPY.md)), any durable memories need to be externalized.
+Agents should:
+
+1. Write important findings to persistent files (daily logs, project docs)
+2. Commit and push to version control
+3. Only then allow the session to reset
+
+If your agent operates on a timer (heartbeat or cron), build the flush into
+the schedule. Memory Shepherd (see [memory-shepherd/README.md](../memory-shepherd/README.md))
+handles the MEMORY.md reset cycle, but agents need to be taught to externalize
+*before* the reset fires.
+
+**Tip:** Include a brief explanation of the memory system in your baseline so
+the agent knows to externalize important findings. See
+[WRITING-BASELINES.md](../memory-shepherd/docs/WRITING-BASELINES.md) for how.
+
+### Three-Tier Memory Persistence
+
+For agents running long enough to accumulate real knowledge, use three tiers:
+
+| Tier | Storage | Lifetime | Example |
+|------|---------|----------|---------|
+| Scratch notes | Below `---` in MEMORY.md | Until next reset (hours) | "PR #42 waiting on CI" |
+| Daily logs | `memory/YYYY-MM-DD.md` | Days to weeks | "Found auth bug, fixed in commit abc123" |
+| Permanent knowledge | Project repo, docs, baselines | Permanent | Architecture decisions, lessons learned |
+
+Scratch notes get archived by Memory Shepherd. Daily logs get reviewed and
+distilled into permanent knowledge periodically. Nothing important should live
+only in scratch notes.
+
+### Text > Brain
+
+Agents don't have persistent memory between sessions — only files persist.
+"Mental notes" don't survive restarts.
+
+**Rule:** If it's worth remembering, write it to a file. If someone says
+"remember this," write it to today's daily log. If you learn a lesson, write
+it to the shared lessons file.
+
+---
+
+## Tool Calling Reliability
+
+### Making Local Models Use Tools
+
+Local models (Qwen, etc.) sometimes answer questions directly instead of using
+provided tools. Two-layer fix:
+
+1. **Prompt layer:** Add explicit instructions: `"You MUST use the provided
+   tools. Do not answer directly. Always call a tool."`
+2. **API layer:** The tool proxy injects stop tokens (`stop_token_ids: [151645]`)
+   to prevent runaway generation after tool calls.
+
+**Sampling settings that help:**
+
+```json
+{
+  "temperature": 0.1,
+  "top_p": 0.1
+}
+```
+
+Lower temperature reduces "creative" responses that skip tool use.
+
+### The Stop Prompt
+
+For sub-agent tasks, always end with a stop prompt:
+
+```
+Reply "Done". Do not output JSON. Do not loop.
+```
+
+Without this, local models often:
+- Output tool calls as raw JSON text instead of structured calls
+- Enter infinite loops repeating the same action
+- Continue generating after completing the task
+
+The stop prompt is a safety net on top of the proxy's `MAX_TOOL_CALLS` limit
+(see README for configuration).
+
+### Atomic Chains for Multi-Step Tasks
+
+Local models struggle with sequential tool chains (read file → transform →
+write result). They conflate steps, loop, or skip actions.
+
+**Fix:** Break multi-step tasks into single-action agents:
+
+```
+Agent 1 (read file) → output → Agent 2 (write result)
+```
+
+Key principles:
+1. **One action per agent** — read OR write, never both in sequence
+2. **Pass data through spawn results** — not shared state
+3. **Verify side effects, not output text** — check the file exists, not what
+   the agent said it did
+4. **Include the stop prompt** in every sub-agent task
+
+**When to use atomic chains:**
+- Multi-step file operations
+- Read → transform → write pipelines
+- Any task where local models loop on sequential tools
+
+---
+
+## Production Safety
+
+### Never Hot-Work Production
+
+If your agent runs on the same server as its infrastructure (gateway, proxy,
+vLLM), never modify that infrastructure while the agent is live.
+
+**What happens:** Multiple gateway processes competing for the same port.
+Connection drops. "Pairing required" errors. Silent failures that look like
+model problems but are actually process conflicts.
+
+**Rule:** Use a separate machine or container for testing changes. Promote to
+production only after validation. If you only have one machine, stop the agent
+before making infrastructure changes.
+
+This applies to:
+- Gateway config changes
+- Proxy updates
+- vLLM restarts
+- systemd service modifications
+
+### Docker Container → Host Networking
+
+If OpenClaw runs in Docker and needs to reach services on the host (vLLM,
+proxy, Token Spy):
+
+- Use `172.17.0.1` (Docker bridge IP) instead of `127.0.0.1` in URLs
+- Add firewall rules: `ufw allow from 172.17.0.0/16 to any port <PORT>`
+- `localhost` inside a container refers to the container, not the host
+
+### Verify Before Claiming
+
+Status updates are not proof of completion. Agents sometimes report "done"
+before verifying the work actually happened.
+
+**Rule:** Working tree state > status reports.
+
+Before declaring a task complete:
+- Check `git status` — are files actually committed?
+- Check `git log` — does the commit exist?
+- Test the implementation — does it actually work?
+- Check file existence — does the output file exist on disk?
+
+Premature completion claims waste time because the next agent in the chain
+assumes the work is done.
+
+---
+
+## Versioning & Rollback
+
+### Snapshot Before Experimenting
+
+Before any experiment on production infrastructure:
+
+1. **Map** everything that might be touched (be thorough)
+2. **Capture state** to version control with a tag
+3. **Push before changing** — no baseline = no rollback
+4. When something breaks → `git diff` between versions to find the change
+
+```bash
+git add -A && git commit -m "pre-experiment snapshot"
+git tag -a v1.2.0 -m "Before proxy v5 experiment"
+git push && git push --tags
+```
+
+Compare: `git diff v1.1.0 v1.2.0`
+Rollback: `git checkout v1.1.0 -- path/to/file`
+
+The effort of tagging before changes is trivial. The cost of not having a
+rollback point is hours of debugging.
+
+---
+
+## Local Model Quirks
+
+### Sub-Agent Announcements Are Normal
+
+Local Qwen agents running under OpenClaw will sometimes announce "Research
+complete" or similar status messages multiple times. This is normal OpenClaw
+chaining behavior, not a bug.
+
+### Bash Syntax in Scripts
+
+When copying code between languages, watch for Python-isms in Bash:
+
+- **Wrong:** `"""` for docstrings in Bash (causes syntax errors)
+- **Right:** `#` comments in Bash
+
+This seems obvious but catches agents that are primarily trained on Python.
+
+---
+
+## Cost-Aware Task Allocation
+
+If you run both local and cloud models, allocate work by cost sensitivity:
+
+| Task Type | Best For | Why |
+|---|---|---|
+| Testing, benchmarking, iteration | Local model | Zero cost, unlimited retries |
+| Large file analysis | Local model | 128K context at $0 |
+| Code generation, boilerplate | Local model | High volume, low judgment |
+| Architecture decisions | Cloud model | Complex reasoning worth the cost |
+| Code review | Cloud model | Nuance and quality matter |
+| Customer-facing output | Cloud model | Reliability and polish |
+
+Every testing task a local model handles saves cloud API credits. The savings
+compound — a single day of local testing can save $50-100+ in API calls.
+
+For a more detailed division of labor in multi-agent setups, see
+[MULTI-AGENT-PATTERNS.md](MULTI-AGENT-PATTERNS.md).
+
+### Burn Rate Awareness
+
+With Token Spy tracking costs (see [TOKEN-SPY.md](TOKEN-SPY.md)), establish a
+baseline burn rate for your workload. Know what "normal" looks like so you can
+spot anomalies:
+
+- Sudden cost spikes often mean an agent entered a retry loop
+- Flat-zero cost on a cloud agent means it stopped working (not that it's efficient)
+- Sub-agent spawns multiply cost — 10 parallel cloud sub-agents at $0.05 each = $0.50 per round
+
+---
+
+## Monitoring: Two Different Questions
+
+If you run both Token Spy and a vLLM monitor (Prometheus/Grafana), understand
+that they answer different questions:
+
+| Monitor | Question It Answers | Key Metrics |
+|---|---|---|
+| Token Spy (:9110) | How much did we spend? | Tokens, cost, session health |
+| vLLM Monitor (:9115) | Is the GPU overloaded? | VRAM, queue depth, tokens/sec |
+
+**Why they diverge:**
+- Local model runs: Token Spy shows $0, vLLM shows lots of tokens processed
+- Cache hits: Token Spy shows reduced cost, vLLM shows no request at all
+- Failed retries: Token Spy shows the cost of attempts, vLLM shows the load
+
+Both are useful. Neither replaces the other.
diff --git a/memory-shepherd/docs/WRITING-BASELINES.md b/memory-shepherd/docs/WRITING-BASELINES.md
index 9e878d969..f5ccd344f 100644
--- a/memory-shepherd/docs/WRITING-BASELINES.md
+++ b/memory-shepherd/docs/WRITING-BASELINES.md
@@ -52,6 +52,9 @@ The most effective pattern we've found is explicit autonomy tiers. Agents need t
 
 This eliminates the "should I ask or just do it?" hesitation that wastes cycles.
 
+For a deeper dive into autonomy tiers and infrastructure protection, see
+[GUARDIAN.md](../../docs/GUARDIAN.md).
+
 ### 4. Capabilities and Tools
 
 Tell the agent what it can actually use. Agents that know their tools are dramatically more effective than ones guessing.