Skip to content

Commit c63b1d5

Browse files
Merge pull request #8 from Light-Heart-Labs/add-operational-guides
Add operational guides: lessons, multi-agent patterns, infrastructure…
2 parents 5258e40 + 52a0158 commit c63b1d5

5 files changed

Lines changed: 853 additions & 1 deletion

File tree

README.md

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,9 @@ Supports tiered health checks (port listening, HTTP endpoints, custom commands,
5454
### Architecture Docs
5555
Deep-dive documentation on how OpenClaw talks to vLLM, why the proxy exists, how session files work, and the five failure points that kill local setups.
5656

57+
### Operational Guides
58+
Lessons learned from running agents 24/7, multi-agent coordination patterns, and infrastructure protection strategies — all discovered by persistent agents running on local hardware. See the [docs/](docs/) directory.
59+
5760
---
5861

5962
## The Bigger Picture
@@ -349,7 +352,10 @@ LightHeart-OpenClaw/
349352
├── docs/
350353
│ ├── SETUP.md # Full local setup guide
351354
│ ├── ARCHITECTURE.md # How it all fits together
352-
│ └── TOKEN-SPY.md # Token Spy setup & API reference
355+
│ ├── TOKEN-SPY.md # Token Spy setup & API reference
356+
│ ├── OPERATIONAL-LESSONS.md # Hard-won lessons from 24/7 agent ops
357+
│ ├── MULTI-AGENT-PATTERNS.md # Coordination, swarms, and reliability
358+
│ └── GUARDIAN.md # Infrastructure protection & autonomy tiers
353359
└── LICENSE
354360
```
355361

docs/GUARDIAN.md

Lines changed: 253 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
# Infrastructure Protection — Guardians, Autonomy Tiers, and Safety Nets
2+
3+
Agents with filesystem access and shell execution can — and will — break their
4+
own infrastructure. This doc covers patterns for preventing that: immutable
5+
watchdogs, explicit permission tiers, and the self-modification problem.
6+
7+
These patterns complement the session-level protections (session watchdog,
8+
Memory Shepherd) with system-level protections. Session tools keep agents
9+
*running*; these patterns keep agents from *breaking what they run on*.
10+
11+
---
12+
13+
## The Problem
14+
15+
Persistent agents with tool access can:
16+
17+
- Kill their own gateway process while debugging something else
18+
- Modify configs they depend on (proxy, vLLM, systemd services)
19+
- Fill disks with log output or generated files
20+
- Restart services during active sessions, losing state
21+
- Overwrite their own baseline files (the ones Memory Shepherd restores from)
22+
23+
These aren't hypothetical. They happen when agents are resourceful — which is
24+
exactly the behavior you want, applied to the wrong target.
25+
26+
---
27+
28+
## The Guardian Pattern
29+
30+
A guardian is a watchdog process that monitors critical infrastructure and
31+
auto-recovers from failures. The key property: **agents cannot modify or
32+
disable it.**
33+
34+
### Design Principles
35+
36+
1. **Runs as root** (or a privileged user the agent can't impersonate)
37+
2. **Immutable**`chattr +i` on the script file prevents modification
38+
3. **Self-healing** — re-sets its own immutable flags if cleared
39+
4. **Tiered monitoring** — not everything is equally critical
40+
5. **Conservative recovery** — soft restart first, restore from backup only
41+
after repeated failures
42+
43+
### What It Monitors
44+
45+
Organize monitored resources into tiers by criticality:
46+
47+
| Tier | Resources | Recovery |
48+
|---|---|---|
49+
| 1 (Critical) | Agent gateway, API proxy | Auto-restart, alert on failure |
50+
| 2 (Important) | vLLM inference, tool proxy | Auto-restart with backoff |
51+
| 3 (Supporting) | Token Spy, monitoring, cron | Auto-restart, no alert |
52+
| 4 (Auxiliary) | Log rotation, cleanup timers | Log warning, skip |
53+
54+
### Recovery Strategy
55+
56+
```
57+
Service fails
58+
→ Soft restart (systemctl restart)
59+
→ Wait 10s, check again
60+
→ If still down: soft restart #2
61+
→ Wait 10s, check again
62+
→ If still down: soft restart #3
63+
→ Wait 10s, check again
64+
→ If still down after 3 attempts: restore config from backup, restart
65+
→ Alert operator
66+
```
67+
68+
Three-strike recovery prevents thrashing while giving transient failures a
69+
chance to self-resolve.
70+
71+
### Snapshot Generations
72+
73+
Keep N generations of known-good configs (default: 5):
74+
75+
```
76+
/var/lib/guardian/backups/
77+
├── gen-1/ # Most recent known-good
78+
├── gen-2/
79+
├── gen-3/
80+
├── gen-4/
81+
└── gen-5/ # Oldest
82+
```
83+
84+
When a service can't recover after 3 restarts, restore from the most recent
85+
generation. If that fails, try the next. This handles the case where the
86+
most recent "good" config was actually the one that broke things.
87+
88+
### Making It Immutable
89+
90+
```bash
91+
# Install the guardian script
92+
sudo cp guardian.sh /usr/local/bin/guardian.sh
93+
sudo chmod 755 /usr/local/bin/guardian.sh
94+
95+
# Make it immutable — even root needs to explicitly remove this flag
96+
sudo chattr +i /usr/local/bin/guardian.sh
97+
98+
# Verify
99+
lsattr /usr/local/bin/guardian.sh
100+
# Should show: ----i--------e-- /usr/local/bin/guardian.sh
101+
```
102+
103+
**Why immutable matters:** An agent with sudo access or root shell can `rm` or
104+
edit any file. `chattr +i` prevents modification even by root — the flag must
105+
be explicitly removed first. The guardian re-sets its own flag on startup,
106+
so even if an agent manages to clear it, the next check cycle restores it.
107+
108+
To update the guardian:
109+
```bash
110+
sudo chattr -i /usr/local/bin/guardian.sh
111+
sudo cp guardian-v2.sh /usr/local/bin/guardian.sh
112+
sudo chattr +i /usr/local/bin/guardian.sh
113+
```
114+
115+
### Systemd Integration
116+
117+
```ini
118+
[Unit]
119+
Description=Infrastructure Guardian
120+
After=network.target
121+
122+
[Service]
123+
Type=simple
124+
ExecStart=/usr/local/bin/guardian.sh
125+
Restart=always
126+
RestartSec=10
127+
128+
[Install]
129+
WantedBy=multi-user.target
130+
```
131+
132+
`Restart=always` ensures the guardian itself restarts if killed.
133+
134+
---
135+
136+
## Autonomy Tiers
137+
138+
Tell agents explicitly what they can and can't do. The most effective pattern
139+
is a tiered system — not a flat list of rules.
140+
141+
### The Tiers
142+
143+
| Tier | Label | Examples | Rationale |
144+
|---|---|---|---|
145+
| 0 | **Just do it** | Read files, run tests, draft PRs, push to feature branches, research, claim work, update scratch notes | Low risk, high frequency. Asking permission for these wastes cycles. |
146+
| 1 | **Peer review** | Config changes to local services, new tools before deploy, research conclusions before sharing | Medium risk. Another agent or a quick human check prevents mistakes. |
147+
| 2 | **Escalate** | Production systems, external communications, spending money, irreversible actions, OpenClaw/vLLM config changes | High risk. Always requires human approval. |
148+
149+
### Implementing Tiers in Baselines
150+
151+
Add autonomy tiers to your agent's baseline (see
152+
[WRITING-BASELINES.md](../memory-shepherd/docs/WRITING-BASELINES.md)):
153+
154+
```markdown
155+
## Autonomy Tiers
156+
157+
**Tier 0 — Just do it:** Chat, research, experiments, repo pushes,
158+
test runs, claiming work, opinions, scratch notes.
159+
160+
**Tier 1 — Peer review:** Config changes, new tools, research
161+
conclusions. Get a review from [reviewer agent] or a human.
162+
163+
**Tier 2 — Escalate:** Production infrastructure, external comms,
164+
money, anything irreversible. Always ask [human operator].
165+
```
166+
167+
The key is making tiers concrete with examples. "Be careful with production"
168+
is Tier 2 phrased vaguely. "Never touch the production database without
169+
explicit approval from the operator" is Tier 2 phrased usefully.
170+
171+
### The Self-Modification Rule
172+
173+
If an agent's code touches its **own** infrastructure, it must not modify
174+
it directly:
175+
176+
1. Spawn a dev environment (separate machine, container, or branch)
177+
2. Make changes there
178+
3. Test and validate
179+
4. Promote to production only after verification
180+
181+
**Why:** An agent that modifies the gateway it runs on can crash itself
182+
mid-operation. There's no recovery from "I broke the thing that runs me."
183+
184+
This is the production hot-work lesson (see
185+
[OPERATIONAL-LESSONS.md](OPERATIONAL-LESSONS.md)) formalized as a rule.
186+
187+
---
188+
189+
## Baseline Integrity Protection
190+
191+
Memory Shepherd's baseline files are critical — they define who each agent is
192+
after every reset. If baselines get corrupted, agents get bad resets.
193+
194+
### Immutable Baselines
195+
196+
```bash
197+
# Lock baseline files
198+
sudo chattr +i memory-shepherd/baselines/*.md
199+
200+
# To update, temporarily unlock
201+
sudo chattr -i memory-shepherd/baselines/my-agent-MEMORY.md
202+
vim memory-shepherd/baselines/my-agent-MEMORY.md
203+
sudo chattr +i memory-shepherd/baselines/my-agent-MEMORY.md
204+
```
205+
206+
### Checksum Validation
207+
208+
```bash
209+
# Generate checksums after writing baselines
210+
sha256sum memory-shepherd/baselines/*.md > memory-shepherd/baselines/.checksums
211+
212+
# Verify before each reset
213+
sha256sum --check memory-shepherd/baselines/.checksums || echo "TAMPERING DETECTED"
214+
```
215+
216+
Add the checksum verification to the Memory Shepherd workflow or as a
217+
pre-reset hook.
218+
219+
### Version Control
220+
221+
Keep baselines in version control with the rest of the repo. This gives you:
222+
- Full change history (who changed what, when)
223+
- Rollback capability (`git checkout <hash> -- baselines/`)
224+
- Diff visibility (`git diff` shows exactly what changed)
225+
- Branch-based review for baseline updates
226+
227+
---
228+
229+
## Combining Protections
230+
231+
The full protection stack, from session level to system level:
232+
233+
```
234+
Session Level (keeps agents running):
235+
├── Session Watchdog — prevents context overflow crashes
236+
├── Token Spy — monitors cost, auto-resets bloated sessions
237+
└── Memory Shepherd — resets memory to baseline, prevents drift
238+
239+
System Level (keeps infrastructure intact):
240+
├── Guardian — monitors services, auto-recovers failures
241+
├── Autonomy Tiers — explicit permission boundaries
242+
├── Baseline Integrity — immutable + checksummed identity files
243+
└── Self-Modification Rule — never hot-work your own infrastructure
244+
```
245+
246+
Session tools are documented in the main [README](../README.md),
247+
[TOKEN-SPY.md](TOKEN-SPY.md), and [memory-shepherd/README.md](../memory-shepherd/README.md).
248+
This doc covers the system-level complement.
249+
250+
**The goal is defense in depth.** No single protection catches everything.
251+
The session watchdog catches context overflow but not infrastructure damage.
252+
The guardian catches service failures but not identity drift. Together, they
253+
cover the full failure surface of persistent agents.

0 commit comments

Comments
 (0)