diff --git a/README.md b/README.md
index 51b9995cb..1744e865f 100644
--- a/README.md
+++ b/README.md
@@ -1,10 +1,29 @@
 # LightHeart OpenClaw
 
-**Keep your AI agents running. No matter what they do to themselves.**
-
-An open source operations toolkit for persistent LLM agents. Built for [OpenClaw](https://openclaw.io) but many components work with any agent framework or service stack.
-
-This toolkit is the infrastructure layer of a proven multi-agent architecture — the [OpenClaw Collective](COLLECTIVE.md) — where 3 AI agents coordinate autonomously on shared projects using local GPU hardware. The companion repository **Android-Labs** (private) is the proof of work: 3,464 commits from 3 agents over 8 days, producing three shipping products and 50+ technical research documents. These tools kept them running.
+**A methodology for building persistent AI agent teams that actually work.**
+
+Patterns, tools, and battle-tested operational knowledge for running AI agents
+that stay up for hours, days, and weeks — coordinating with each other, learning
+from failures, and shipping real software. Built from production experience
+running 3+ agents 24/7 on local hardware.
+
+About 70% of this repository is framework-agnostic. The patterns for identity,
+memory, coordination, autonomy, and observability apply to any agent system —
+Claude Code, LangChain, AutoGPT, custom agents, or anything else that runs long
+enough to accumulate state. The remaining 30% is a reference implementation
+using [OpenClaw](https://openclaw.io) and vLLM that demonstrates the patterns
+concretely.
+
+This is the infrastructure layer of a proven multi-agent architecture — the
+[OpenClaw Collective](COLLECTIVE.md) — where 3 AI agents coordinate
+autonomously on shared projects using local GPU hardware. The companion
+repository **Android-Labs** (private) is the proof of work: 3,464 commits from
+3 agents over 8 days, producing three shipping products and 50+ technical
+research documents. These tools kept them running.
+
+**Start here:** [docs/PHILOSOPHY.md](docs/PHILOSOPHY.md) — the conceptual
+foundation, five pillars, complete failure taxonomy, and a reading map based on
+what you're building.
 
 | Component | What it does | Requires OpenClaw? | Platform |
 |-----------|-------------|-------------------|----------|
@@ -20,42 +39,57 @@ This toolkit is the infrastructure layer of a proven multi-agent architecture 
 
 ## What's Inside
 
-### Session Watchdog
-A lightweight daemon that monitors `.jsonl` session files and automatically cleans up bloated ones before they hit the context ceiling. Runs on a timer, catches danger-zone sessions, deletes them, and removes their references from `sessions.json` so the gateway seamlessly creates fresh ones.
-
-**The agent doesn't even notice.** It just gets a clean context window mid-conversation. No more `Context overflow: prompt too large for the model` crashes.
-
-### vLLM Tool Call Proxy (v4)
-A transparent proxy between OpenClaw and vLLM that makes local model tool calling actually work. Handles SSE re-wrapping, tool call extraction from text, response cleaning, and loop protection.
+### The Methodology
 
-Without it, you get "No reply from agent" with 0 tokens. With it, your local agents just work.
+These docs capture what we learned running persistent agent teams. They apply to
+any framework.
 
-### Token Spy — API Cost & Usage Monitor
-A transparent API proxy that captures per-turn token usage, cost, latency, and session health for cloud model calls (Anthropic, OpenAI, Moonshot). Point your agent's `baseUrl` at Token Spy instead of the upstream API — it logs everything, then forwards requests and responses untouched, including SSE streams.
+| Doc | What It Covers |
+|-----|---------------|
+| [PHILOSOPHY.md](docs/PHILOSOPHY.md) | **Start here.** Five pillars of persistent agents, failure taxonomy, reading map, framework portability guide |
+| [WRITING-BASELINES.md](memory-shepherd/docs/WRITING-BASELINES.md) | How to define agent identity that survives resets and drift |
+| [MULTI-AGENT-PATTERNS.md](docs/MULTI-AGENT-PATTERNS.md) | Coordination protocols, reliability math, sub-agent spawning, echo chamber prevention, supervisor pattern |
+| [OPERATIONAL-LESSONS.md](docs/OPERATIONAL-LESSONS.md) | Silent failures, memory management, tool calling reliability, production safety, background GPU automation |
+| [GUARDIAN.md](docs/GUARDIAN.md) | Infrastructure protection, autonomy tiers, immutable watchdogs, defense in depth |
 
-Includes a real-time dashboard with session health cards, cost charts, token breakdown, and cumulative spend tracking. Can auto-kill sessions that exceed a configurable character limit. Works with any OpenAI-compatible or Anthropic API client.
+### The Reference Implementation (OpenClaw + vLLM)
 
-### Golden Configs
-Battle-tested `openclaw.json` and `models.json` templates with the critical `compat` block that prevents OpenClaw from sending parameters vLLM silently rejects. Getting these four flags wrong produces mysterious failures with no error messages — we figured them out so you don't have to.
+Working tools that implement the methodology. Use them directly or adapt the
+patterns to your stack.
 
-### Workspace Templates
-Starter personality files (`SOUL.md`, `IDENTITY.md`, `TOOLS.md`, `MEMORY.md`) that OpenClaw injects into every agent session. Customize your agent's personality, knowledge, and working memory.
+**Session Watchdog** — Monitors `.jsonl` session files and cleans up bloated
+ones before they hit the context ceiling. The agent doesn't notice — it just
+gets a clean context window mid-conversation.
 
-### Memory Shepherd
-Periodic memory reset for persistent LLM agents. Agents accumulate scratch notes in `MEMORY.md` during operation — Memory Shepherd archives those notes and restores the file to a curated baseline on a schedule. Keeps agents on-mission by preventing context drift, memory bloat, and self-modification of instructions.
+**vLLM Tool Call Proxy (v4)** — Transparent proxy between OpenClaw and vLLM
+that makes local model tool calling work. Handles SSE re-wrapping, tool call
+extraction from text, response cleaning, and loop protection.
 
-Defines a `---` separator convention: everything above is operator-controlled identity (rules, capabilities, pointers), everything below is agent scratch space that gets archived and cleared. See [memory-shepherd/README.md](memory-shepherd/README.md) for full documentation.
+**Token Spy** — Transparent API proxy that captures per-turn token usage, cost,
+latency, and session health for cloud model calls (Anthropic, OpenAI, Moonshot).
+Real-time dashboard with session health cards, cost charts, and auto-kill for
+sessions exceeding configurable limits. Works with any OpenAI-compatible or
+Anthropic API client.
 
-### Guardian
-Self-healing process watchdog for LLM infrastructure. Runs as a root systemd service that agents cannot kill or modify. Monitors processes, systemd services, Docker containers, and file integrity — automatically restoring from known-good backups when things break.
+**Memory Shepherd** — Periodic memory reset for persistent agents. Archives
+scratch notes and restores MEMORY.md to a curated baseline on a schedule.
+Defines the `---` separator convention: operator-controlled identity above,
+agent scratch space below.
 
-Supports tiered health checks (port listening, HTTP endpoints, custom commands, JSON validation), a recovery cascade (soft restart → backup restore → restart), generational backups with immutable flags, and restart delegation chains. Everything is config-driven via an INI file. See [guardian/README.md](guardian/README.md) for full documentation.
+**Guardian** — Self-healing process watchdog for LLM infrastructure. Runs as a
+root systemd service that agents cannot kill or modify. Monitors processes,
+systemd services, Docker containers, and file integrity — automatically
+restoring from known-good backups when things break. Supports tiered health
+checks, recovery cascades, and generational backups. See
+[guardian/README.md](guardian/README.md) for full documentation.
 
-### Architecture Docs
-Deep-dive documentation on how OpenClaw talks to vLLM, why the proxy exists, how session files work, and the five failure points that kill local setups.
+**Golden Configs** — Battle-tested `openclaw.json` and `models.json` with the
+critical `compat` block that prevents silent failures. Workspace templates for
+agent personality, identity, tools, and working memory.
 
-### Operational Guides
-Lessons learned from running agents 24/7, multi-agent coordination patterns, and infrastructure protection strategies — all discovered by persistent agents running on local hardware. See the [docs/](docs/) directory.
+**Architecture Docs** — How OpenClaw talks to vLLM, why the proxy exists, how
+session files work, and the five failure points that kill local setups.
+See [ARCHITECTURE.md](docs/ARCHITECTURE.md) and [SETUP.md](docs/SETUP.md).
 
 ---
 
@@ -350,6 +384,7 @@ LightHeart-OpenClaw/
 │   └── docs/
 │       └── HEALTH-CHECKS.md           # Health check & recovery reference
 ├── docs/
+│   ├── PHILOSOPHY.md                  # Start here — pillars, failures, reading map
 │   ├── SETUP.md                       # Full local setup guide
 │   ├── ARCHITECTURE.md                # How it all fits together
 │   ├── TOKEN-SPY.md                   # Token Spy setup & API reference
@@ -439,4 +474,4 @@ Apache 2.0 — see [LICENSE](LICENSE)
 
 ---
 
-Built by [Lightheart Labs](https://github.com/Light-Heart-Labs) and the [OpenClaw Collective](COLLECTIVE.md) from real production pain running autonomous AI agents on local hardware.
+Built from production experience by [Lightheart Labs](https://github.com/Light-Heart-Labs) and the [OpenClaw Collective](COLLECTIVE.md). The patterns were discovered by the agents. The docs were written by the agents. The lessons were learned the hard way.
diff --git a/docs/PHILOSOPHY.md b/docs/PHILOSOPHY.md
new file mode 100644
index 000000000..2bb866e4f
--- /dev/null
+++ b/docs/PHILOSOPHY.md
@@ -0,0 +1,353 @@
+# Building Persistent Agent Teams — Philosophy and Patterns
+
+This is the conceptual foundation for everything in this repository. Read this
+first. It explains the principles behind the tools, the failure modes they
+prevent, and how to apply these patterns to any agent framework — not just
+OpenClaw.
+
+The patterns documented here were discovered by running persistent AI agent
+teams on local hardware 24/7. Not simulated, not theoretical — three AI agents
+writing code, reviewing each other's work, and shipping software, supervised
+by a fourth. Every lesson in this repo was learned the hard way, often by the
+agents themselves.
+
+---
+
+## The Core Idea
+
+A persistent agent is not a chatbot. A chatbot processes a request and
+forgets. A persistent agent works across hours, days, and weeks. It
+accumulates knowledge, coordinates with other agents, and operates
+infrastructure it depends on.
+
+This changes everything about how you build and operate them.
+
+Chatbots fail gracefully — the conversation ends, the user starts over.
+Persistent agents fail silently — they drift from their role, corrupt their
+own configuration, exhaust their context window, or coordinate with other
+agents based on assumptions nobody verified. By the time you notice, the
+damage compounds.
+
+This repository is a methodology for preventing that. It includes a reference
+implementation using OpenClaw and vLLM, but the patterns apply to any
+framework: Claude Code, LangChain, AutoGPT, custom agents, or anything else
+that runs long enough to accumulate state.
+
+---
+
+## Five Pillars
+
+Every pattern in this repo maps to one of five principles. If you understand
+these, you understand the entire system.
+
+### 1. Identity — Agents need a constitution, not just a prompt
+
+A persistent agent's identity must survive session resets, context overflow,
+and the agent's own attempts to modify it. This means:
+
+- **Baselines** — A curated document defining who the agent is, what rules it
+  follows, and what tools it has. Stored above a `---` separator in MEMORY.md.
+  The agent can read it but must not modify it.
+- **Periodic resets** — Every few hours, scratch notes get archived and the
+  baseline is restored. The agent starts fresh but knows who it is.
+- **Operator control** — The human defines identity. The agent operates within
+  it. This separation is the most important architectural decision you'll make.
+
+Without identity preservation, agents drift. They accumulate stale context,
+adopt instructions from previous tasks, and gradually become something other
+than what you built. The drift is subtle — you won't notice for hours or days.
+
+**Deep dive:** [WRITING-BASELINES.md](../memory-shepherd/docs/WRITING-BASELINES.md)
+and [Memory Shepherd](../memory-shepherd/README.md)
+
+### 2. Knowledge — Three tiers of persistence, not one
+
+Agents need memory at three timescales, each with different durability:
+
+| Tier | What | Lifetime | Where |
+|------|------|----------|-------|
+| Scratch | Working notes for the current task | Hours (until next reset) | Below `---` in MEMORY.md |
+| Daily | What happened today, raw observations | Days to weeks | `memory/YYYY-MM-DD.md` |
+| Permanent | Architecture decisions, lessons learned | Forever | Project repo, baselines, docs |
+
+The critical insight: **nothing important should live only in scratch notes.**
+Agents must be taught to externalize knowledge upward through the tiers before
+a reset wipes their scratch space. Include an explanation of the memory system
+in the baseline itself — agents that understand their own memory lifecycle
+write better notes and preserve the right things.
+
+Without tiered persistence, agents either lose everything on reset (too
+aggressive) or accumulate unbounded state until they crash (too permissive).
+Three tiers give you the best of both: clean working memory AND durable
+learning.
+
+**Deep dive:** [OPERATIONAL-LESSONS.md](OPERATIONAL-LESSONS.md) (session and
+memory management section)
+
+### 3. Collaboration — Explicit protocols, not implicit coordination
+
+Multiple agents sharing a codebase will overwrite each other's work, amplify
+each other's assumptions, and celebrate phantom success — unless you give them
+explicit rules for coordination.
+
+The three protocols that matter:
+
+- **Branch-based review** — Code changes go through feature branches with
+  agent-identifiable prefixes. A separate agent reviews and merges. Docs and
+  status updates go direct to main.
+- **Heartbeat sync** — Every 15-60 minutes, each agent pulls latest, checks
+  for pending work and reviews, and updates its status. This prevents drift
+  between agents and catches handoffs that would otherwise sit idle.
+- **Echo chamber prevention** — When agents agree too fast, nobody is
+  verifying. The rule: one lead investigator for debugging, independent
+  verification before claiming success, pause when messages fly too fast.
+
+Without explicit protocols, multi-agent systems develop the same dysfunctions
+as human teams — except faster, because agents don't get tired and don't
+second-guess themselves.
+
+**Deep dive:** [MULTI-AGENT-PATTERNS.md](MULTI-AGENT-PATTERNS.md)
+
+### 4. Autonomy — Tiered permissions, not all-or-nothing trust
+
+Agents need to know exactly what they can do freely, what needs a second
+opinion, and what requires human approval. The tiers:
+
+| Tier | Rule | Examples |
+|------|------|---------|
+| 0 — Just do it | Low risk, high frequency | Read files, run tests, push to feature branches, write scratch notes |
+| 1 — Peer review | Medium risk | Config changes, new tools, research conclusions before sharing |
+| 2 — Escalate | High risk, irreversible | Production systems, external communications, spending money |
+
+The principle: **minimize the permission surface at each tier.** If an agent
+can do its job with Tier 0 permissions 90% of the time, it should rarely need
+to escalate. If it's constantly hitting Tier 2, either the tiers are wrong or
+the agent's role is too broad.
+
+Vague rules ("be careful") don't work. Specific rules do ("never push directly
+to main," "never modify another agent's MEMORY.md"). Write 5-7 hard
+boundaries, not 50 guidelines.
+
+**Deep dive:** [GUARDIAN.md](GUARDIAN.md) (autonomy tiers section)
+
+### 5. Observability — You can't manage what you can't measure
+
+Persistent agents fail in ways that are invisible without instrumentation:
+
+- Context fills silently until it overflows
+- Costs accumulate with no per-turn visibility
+- Agents get stuck but the process keeps running
+- Quality degrades gradually as context gets stale
+
+You need two kinds of monitoring:
+
+- **Cost monitoring** (Token Spy) — What are you spending? Per-turn, per-agent,
+  per-session. Catches retry loops, runaway costs, and dead agents (zero cost
+  on a cloud agent means it stopped working, not that it's efficient).
+- **Infrastructure monitoring** (Prometheus/Grafana) — Is the GPU overloaded?
+  Are services healthy? What's the queue depth? Catches resource contention,
+  service crashes, and capacity limits.
+
+These measure different things and will diverge. A local model shows $0 in
+Token Spy but heavy load in GPU metrics. A cached response shows reduced cost
+but no GPU activity. Both views are necessary.
+
+Add a **supervisor agent** — a meta-agent that monitors the team rather than
+doing work. It checks commit frequency, session health, and error patterns
+every 15 minutes, and sends the human a daily briefing. The supervisor is
+judgment, not volume — run it on the most capable model you have.
+
+**Deep dive:** [TOKEN-SPY.md](TOKEN-SPY.md), [OPERATIONAL-LESSONS.md](OPERATIONAL-LESSONS.md)
+(monitoring section), [MULTI-AGENT-PATTERNS.md](MULTI-AGENT-PATTERNS.md)
+(supervisor pattern)
+
+---
+
+## What Breaks and Why
+
+Every pattern in this repo exists because something broke. Here's the full
+failure taxonomy, organized by what goes wrong and what prevents it.
+
+### Session-Level Failures — Agents Stop Working
+
+| Failure | What Happens | Prevention |
+|---------|-------------|-----------|
+| Context overflow | Session grows until it exceeds the model's context window. Agent crashes with "prompt too large." | Session Watchdog monitors file size, transparently swaps in fresh sessions before overflow. |
+| Memory bloat | Scratch notes accumulate, degrading response quality. Agent confuses past and present tasks. | Memory Shepherd archives scratch notes and resets to baseline every 3 hours. |
+| Identity drift | Agent gradually shifts behavior as old context influences new decisions. Sometimes rewrites its own instructions. | Baseline separation (`---` contract). Operator-controlled identity above the line, agent scratch space below. |
+| Cost spike | Agent enters a retry loop or spawns too many cloud sub-agents. Burns through API budget. | Token Spy tracks per-turn cost. Auto-resets sessions that exceed character limits. |
+
+### Coordination Failures — Agents Fight Each Other
+
+| Failure | What Happens | Prevention |
+|---------|-------------|-----------|
+| Merge conflicts | Multiple agents modify the same files simultaneously. | Branch-based review protocol. Agent-prefixed branches. One merger. |
+| Echo chamber | Agents agree without verifying. Celebrate success when files don't exist. | One-lead rule. Independent verification. Pause on rapid-fire messages. |
+| State races | Two agents read the same status file, both claim the same task. | Heartbeat protocol with explicit claiming. STATUS.md as coordination point. |
+| Phantom completion | Agent reports "done" before verifying the work actually happened. | "Working tree state > status reports." Verify files exist, tests pass, commits landed. |
+
+### Infrastructure Failures — The System Breaks
+
+| Failure | What Happens | Prevention |
+|---------|-------------|-----------|
+| Service crash | Gateway, proxy, or vLLM goes down. Agent appears stuck. | Guardian watchdog monitors all services. 3-strike auto-recovery. |
+| Config corruption | Agent modifies a config it depends on. Silent failures. | Immutable files (`chattr +i`). Checksum validation. Guardian restores from backup. |
+| Self-sabotage | Agent kills its own gateway while debugging, or modifies the proxy it routes through. | Autonomy tiers. Self-modification rule: never hot-work your own infrastructure. |
+| GPU contention | Multiple agents flood the GPU with sub-agent requests. One agent's requests time out, session gets stuck. | Custom health checks detect timeout + cleared storm pattern. Guardian auto-restarts stuck gateways. |
+
+### Knowledge Failures — Lessons Get Lost
+
+| Failure | What Happens | Prevention |
+|---------|-------------|-----------|
+| Scratch notes wiped | Important findings live only in scratch space. Reset deletes them. | Three-tier persistence. Teach agents to externalize upward before resets. |
+| Baseline stale | Agent's role changes but baseline doesn't. Agent rediscovers context every reset. | Regular baseline review. If the agent keeps rediscovering the same thing, add it to the baseline. |
+| No shared learning | Each agent discovers the same problems independently. No collective memory. | Shared lessons file (append-only, date-attributed). Daily logs distilled into permanent knowledge. |
+
+---
+
+## Reading Map
+
+### "I'm building a single persistent agent"
+
+Start here, then read in order:
+
+1. [WRITING-BASELINES.md](../memory-shepherd/docs/WRITING-BASELINES.md) —
+   How to define your agent's identity
+2. [Memory Shepherd README](../memory-shepherd/README.md) — How memory resets work
+3. [SETUP.md](SETUP.md) — Getting the infrastructure running (OpenClaw-specific)
+4. [OPERATIONAL-LESSONS.md](OPERATIONAL-LESSONS.md) — What will go wrong
+   and how to fix it
+
+### "I'm running multiple agents together"
+
+Read the single-agent path first, then:
+
+5. [MULTI-AGENT-PATTERNS.md](MULTI-AGENT-PATTERNS.md) — Coordination,
+   swarms, redundancy, the supervisor pattern
+6. [GUARDIAN.md](GUARDIAN.md) — Infrastructure protection and autonomy tiers
+
+### "I want to understand the theory without building anything"
+
+Read this document, then:
+
+1. [WRITING-BASELINES.md](../memory-shepherd/docs/WRITING-BASELINES.md) —
+   The deepest treatment of identity and memory
+2. [MULTI-AGENT-PATTERNS.md](MULTI-AGENT-PATTERNS.md) — Coordination
+   theory, reliability math, and failure modes
+3. [GUARDIAN.md](GUARDIAN.md) — The philosophy of infrastructure
+   self-defense
+
+### "I need to fix something that's broken right now"
+
+Go to the [failure taxonomy](#what-breaks-and-why) above. Find your symptom.
+Follow the link to the prevention strategy.
+
+---
+
+## Using These Patterns With Other Frameworks
+
+About 70% of this repo is framework-agnostic. Here's what applies where.
+
+### Universal Patterns (any agent framework)
+
+These patterns work regardless of whether you use OpenClaw, Claude Code,
+LangChain, AutoGPT, or custom agents:
+
+| Pattern | Core Idea | Adapt By |
+|---------|-----------|----------|
+| **Baseline identity** | Agent has a constitution that survives resets | Store in system prompt, config file, or persistent state — whatever your framework reads on startup |
+| **Memory tiers** | Scratch / daily / permanent knowledge layers | Implement file-based persistence at each tier. The storage mechanism doesn't matter; the separation does. |
+| **Autonomy tiers** | Tiered permissions (do freely / peer review / escalate) | Encode in the agent's system prompt or baseline. No framework support needed — it's behavioral. |
+| **Branch-based review** | Code changes through feature branches with review gates | Use Git conventions. Framework-independent. |
+| **Heartbeat protocol** | Periodic sync, status checks, handoff detection | Implement as a cron job, scheduled task, or supervisor loop |
+| **Echo chamber prevention** | One lead, independent verification, pause on rapid-fire | Behavioral rules in agent baselines. No code required. |
+| **Supervisor agent** | Meta-agent that monitors team health and briefs the human | Any agent that can read logs, check file sizes, and send messages |
+| **Redundancy math** | Spawn 2 agents, take first success: 67% → 95% reliability | Any system that can run parallel tasks |
+| **Task templates** | Numbered steps, absolute paths, stop prompts | Universal prompt engineering. Works with any LLM. |
+| **Guardian / watchdog** | Immutable process that monitors and auto-recovers services | Bash script + systemd (or equivalent). Framework-independent. |
+| **Failure taxonomy** | Categorized failure modes with mapped preventions | Apply the categories to your system. The failures are universal. |
+
+### OpenClaw / vLLM Specific
+
+These solve problems unique to the OpenClaw + vLLM stack:
+
+| Component | What It Solves | Equivalent In Other Frameworks |
+|-----------|---------------|-------------------------------|
+| Tool proxy | OpenClaw streams; vLLM needs non-streaming for tool extraction | Not needed if your framework handles tool calling natively |
+| Session watchdog | Monitors `.jsonl` session files for size | Adapt to your framework's session storage format |
+| Compat block | Prevents OpenClaw from sending params vLLM rejects | Not needed for cloud APIs or frameworks with native vLLM support |
+| Token Spy | Transparent reverse proxy for API cost monitoring | Works with any OpenAI-compatible or Anthropic client — already portable |
+
+### Translation Guide
+
+**Claude Code agents:** Store baselines in CLAUDE.md or a persistent context
+file. Use the autonomy tiers as behavioral instructions in the system prompt.
+Run Memory Shepherd against whatever file your agent uses for persistent state.
+Token Spy already works with Claude's API via `ANTHROPIC_BASE_URL`.
+
+**LangChain / LlamaIndex agents:** Store baselines in the agent's
+initialization config. Implement the three-tier memory pattern using the
+framework's memory modules. The coordination protocols (branches, heartbeat)
+are Git-level, not framework-level.
+
+**Custom Python agents:** Store baselines in a config file loaded at startup.
+Implement Memory Shepherd's reset cycle as a function that reads the file,
+splits on `---`, archives the bottom, restores the top. The Guardian is a
+standalone bash script — it doesn't care what framework the agents use.
+
+---
+
+## How This Repo Is Organized
+
+```
+PHILOSOPHY.md (you are here)
+  │
+  ├── Identity & Memory
+  │     ├── memory-shepherd/README.md        — How memory resets work
+  │     └── memory-shepherd/docs/
+  │           └── WRITING-BASELINES.md       — How to define agent identity
+  │
+  ├── Coordination & Operations
+  │     ├── MULTI-AGENT-PATTERNS.md          — Sync, swarms, supervisor, reliability
+  │     └── OPERATIONAL-LESSONS.md           — Battle-tested lessons from 24/7 ops
+  │
+  ├── Infrastructure & Safety
+  │     └── GUARDIAN.md                      — Watchdogs, autonomy tiers, protection
+  │
+  └── Reference Implementation (OpenClaw + vLLM)
+        ├── README.md                        — Toolkit overview and quick start
+        ├── ARCHITECTURE.md                  — How OpenClaw talks to vLLM
+        ├── SETUP.md                         — Step-by-step local deployment
+        └── TOKEN-SPY.md                     — Cost monitoring setup and API
+```
+
+The top three sections are framework-agnostic. The reference implementation
+section is OpenClaw-specific but demonstrates the patterns concretely.
+
+---
+
+## The Meta-Lesson
+
+The single most important pattern in this repo isn't a tool or a script. It's
+this: **agents will find every edge case you didn't think of, and the only
+reliable documentation is the post-mortems they write after hitting those
+edges.**
+
+Build a shared lessons file. Make it append-only. Date every entry. Have
+agents write to it when they discover something the hard way. Review it
+periodically and promote the important lessons into baselines and permanent
+docs.
+
+The tools in this repo prevent the catastrophic failures. The lessons file
+captures everything else. Together, they compound — each week, the system gets
+a little more robust, the baselines get a little more complete, and the agents
+get a little more reliable.
+
+That's the real goal: not a system that never fails, but a system that learns
+from every failure and gets better.
+
+---
+
+Built from production experience by [Lightheart Labs](https://github.com/Light-Heart-Labs)
+and their AI agent team. The patterns were discovered by the agents. The docs
+were written by the agents. The lessons were learned the hard way.