Skip to content

Commit 27aa87f

Browse files
Merge pull request #4 from Light-Heart-Labs/add-guardian
Add Guardian — self-healing process watchdog for LLM infrastructure
2 parents 2f171b6 + c590c2f commit 27aa87f

8 files changed

Lines changed: 1834 additions & 0 deletions

File tree

README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -34,6 +34,11 @@ Periodic memory reset for persistent LLM agents. Agents accumulate scratch notes
3434

3535
Defines a `---` separator convention: everything above is operator-controlled identity (rules, capabilities, pointers), everything below is agent scratch space that gets archived and cleared.
3636

37+
### Guardian
38+
Self-healing process watchdog for LLM infrastructure. Runs as a root systemd service that agents cannot kill or modify. Monitors processes, systemd services, Docker containers, and file integrity — automatically restoring from known-good backups when things break.
39+
40+
Supports tiered health checks (port listening, HTTP endpoints, custom commands, JSON validation), a recovery cascade (soft restart → backup restore → restart), generational backups with immutable flags, and restart delegation chains. Everything is config-driven via an INI file.
41+
3742
### Architecture Docs
3843
Deep-dive documentation on how OpenClaw talks to vLLM, why the proxy exists, how session files work, and the five failure points that kill local setups.
3944

@@ -247,6 +252,14 @@ LightHeart-OpenClaw/
247252
│ ├── baselines/ # Baseline MEMORY.md templates
248253
│ └── docs/
249254
│ └── WRITING-BASELINES.md # Guide to writing effective baselines
255+
├── guardian/ # Self-healing process watchdog
256+
│ ├── guardian.sh # Config-driven watchdog script
257+
│ ├── guardian.conf.example # Sanitized example config
258+
│ ├── guardian.service # Systemd unit template
259+
│ ├── install.sh # Installer (systemd + immutable flags)
260+
│ ├── uninstall.sh # Uninstaller
261+
│ └── docs/
262+
│ └── HEALTH-CHECKS.md # Health check & recovery reference
250263
├── docs/
251264
│ ├── SETUP.md # Full local setup guide
252265
│ ├── ARCHITECTURE.md # How it all fits together

guardian/README.md

Lines changed: 384 additions & 0 deletions
Large diffs are not rendered by default.

guardian/docs/HEALTH-CHECKS.md

Lines changed: 347 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,347 @@
1+
# Health Checks Reference
2+
3+
How Guardian monitors resources and recovers from failures.
4+
5+
---
6+
7+
## The Tier System
8+
9+
Organize your config sections by dependency criticality. Guardian checks all sections every cycle, but tiers help you reason about what matters most:
10+
11+
| Tier | Purpose | Example |
12+
|------|---------|---------|
13+
| **Tier 1** | Agent lifeline — agent is dead without these | Gateways, core services |
14+
| **Tier 2** | API routing — agent can't think without these | Proxies, API servers |
15+
| **Tier 3** | Memory & context — agent loses state without these | Databases, config files |
16+
| **Tier 4** | Helpers — degraded but alive without these | Monitoring, bots, utilities |
17+
18+
Guardian doesn't enforce tiers — they're an organizational convention. Every enabled section gets checked every cycle regardless of tier.
19+
20+
---
21+
22+
## Monitoring Types
23+
24+
### `process`
25+
26+
Matches a running process by command-line pattern (via `pgrep -f`). If the process isn't found, Guardian restarts it using `start_cmd`.
27+
28+
```ini
29+
[my-api-server]
30+
enabled=true
31+
type=process
32+
description=My API Server
33+
process_match=uvicorn main:app.*8080
34+
start_cmd=start.sh
35+
start_dir=/opt/myapp
36+
start_user=deploy
37+
start_venv=/opt/myapp/venv
38+
required_ports=8080
39+
health_url=http://127.0.0.1:8080/health
40+
protected_files=/opt/myapp/main.py,/opt/myapp/start.sh
41+
max_soft_restarts=3
42+
restart_grace=10
43+
```
44+
45+
**Recovery flow:**
46+
1. Kill existing process (`pkill -f`, then `pkill -9` after 2s)
47+
2. Free required ports (`fuser -k`)
48+
3. Activate virtualenv if `start_venv` is set
49+
4. `cd` to `start_dir` if set
50+
5. Run `start_cmd` as `start_user` via `nohup` (`.py` files use `python3`, others use `bash`)
51+
52+
### `systemd-user`
53+
54+
Monitors a systemd user service via `systemctl --user is-active`. Restarts via `systemctl --user restart`.
55+
56+
```ini
57+
[my-gateway]
58+
enabled=true
59+
type=systemd-user
60+
description=OpenClaw Gateway
61+
service=openclaw-gateway.service
62+
systemd_user=deploy
63+
required_ports=3100
64+
health_port=3100
65+
protected_files=/home/deploy/.openclaw/openclaw.json
66+
protected_service=/home/deploy/.config/systemd/user/openclaw-gateway.service
67+
max_soft_restarts=3
68+
restart_grace=15
69+
skip_immutable=true
70+
```
71+
72+
**Important:** `systemd_user` is required — Guardian needs to know which user's systemd instance to interact with.
73+
74+
### `docker`
75+
76+
Monitors a Docker container via `docker inspect`. Restarts via `docker restart`.
77+
78+
```ini
79+
[my-database]
80+
enabled=true
81+
type=docker
82+
description=Vector Database
83+
container_name=qdrant-prod
84+
required_ports=6333
85+
health_url=http://127.0.0.1:6333/healthz
86+
max_soft_restarts=3
87+
restart_grace=15
88+
```
89+
90+
### `file-integrity`
91+
92+
Validates that files exist, are non-empty, and optionally contain valid JSON with required keys and values. Recovery restores from backup.
93+
94+
```ini
95+
[my-settings]
96+
enabled=true
97+
type=file-integrity
98+
description=Application Settings
99+
protected_files=/opt/myapp/settings.json,/opt/myapp/config.yaml
100+
required_json_keys=session_limit,poll_interval
101+
required_json_values=database.host=localhost
102+
max_soft_restarts=1
103+
restart_grace=5
104+
```
105+
106+
---
107+
108+
## Config Key Reference
109+
110+
### Common Keys (all types)
111+
112+
| Key | Required | Default | Description |
113+
|-----|----------|---------|-------------|
114+
| `enabled` | yes | `false` | Must be `true` to monitor this section |
115+
| `type` | yes || `process`, `systemd-user`, `docker`, or `file-integrity` |
116+
| `description` | no | section name | Human-readable description for logs |
117+
| `max_soft_restarts` | no | `3` | Soft restarts before falling back to backup restore |
118+
| `restart_grace` | no | `10` | Seconds to wait after a restart before rechecking |
119+
| `restart_via` | no || Delegate restart to another section (by section name) |
120+
| `protected_files` | no || Comma-separated file paths to backup and monitor |
121+
| `protected_service` | no || Systemd service file path to backup |
122+
| `skip_immutable` | no | `false` | If `true`, don't set immutable flag on restored files |
123+
| `required_ports` | no || Comma-separated ports that must be listening |
124+
| `health_url` | no || URL that must return HTTP 2xx |
125+
| `health_port` | no || Port that must be listening (simpler than `required_ports`) |
126+
| `health_cmd` | no || Shell command that must exit 0 |
127+
128+
### Process-Specific Keys
129+
130+
| Key | Required | Default | Description |
131+
|-----|----------|---------|-------------|
132+
| `process_match` | yes || Pattern for `pgrep -f` matching |
133+
| `start_cmd` | yes || Script/binary to start (`.py` → python3, else bash) |
134+
| `start_dir` | no || Working directory for `start_cmd` |
135+
| `start_user` | no | `$(whoami)` | Unix user to run as. **Warning:** defaults to `root` when Guardian runs as a systemd service — always set this explicitly. |
136+
| `start_venv` | no || Python virtualenv path to activate before starting |
137+
138+
### Systemd-User Keys
139+
140+
| Key | Required | Default | Description |
141+
|-----|----------|---------|-------------|
142+
| `service` | yes || Systemd service unit name |
143+
| `systemd_user` | yes || Unix user whose systemd instance to manage |
144+
145+
### Docker Keys
146+
147+
| Key | Required | Default | Description |
148+
|-----|----------|---------|-------------|
149+
| `container_name` | yes || Docker container name |
150+
151+
### File-Integrity Keys
152+
153+
| Key | Required | Default | Description |
154+
|-----|----------|---------|-------------|
155+
| `required_json_keys` | no || Comma-separated top-level keys that must exist in JSON files |
156+
| `required_json_values` | no || Comma-separated `path=value` pairs to validate in JSON files |
157+
158+
---
159+
160+
## The Recovery Cascade
161+
162+
When Guardian detects an unhealthy resource, it follows this sequence:
163+
164+
```
165+
UNHEALTHY detected
166+
167+
├─ failure_count <= max_soft_restarts?
168+
│ └─ YES → Soft restart (just restart the service/process)
169+
│ Wait restart_grace seconds
170+
171+
└─ NO (soft restarts exhausted)
172+
└─ Restore protected_files from backup
173+
Restart the service/process
174+
Wait restart_grace seconds
175+
Reset failure counter to 0
176+
```
177+
178+
**Failure tracking** is persistent across cycles. Each section gets a failure counter file in `/var/lib/guardian/state/`. When a resource recovers (detected healthy after being unhealthy), the counter resets to 0.
179+
180+
### Example timeline for `process` / `systemd-user` / `docker`
181+
182+
With `max_soft_restarts=3`:
183+
184+
| Cycle | Status | Action |
185+
|-------|--------|--------|
186+
| 1 | Unhealthy | Soft restart #1 (just restart the service) |
187+
| 2 | Unhealthy | Soft restart #2 |
188+
| 3 | Unhealthy | Soft restart #3 |
189+
| 4 | Unhealthy | Restore files from backup, then restart service |
190+
| 5 | Healthy | Reset counter, take snapshot |
191+
192+
### How `file-integrity` differs
193+
194+
For `file-integrity` sections, the "restart" action IS a backup restore (there's no process to restart — the recovery is restoring the file). This means the soft restart and the backup-restore fallback both do the same thing: restore from backup.
195+
196+
In practice, use `max_soft_restarts=0` for file-integrity sections. This skips the redundant "soft restart" phase and goes straight to restore:
197+
198+
```ini
199+
[my-config-files]
200+
type=file-integrity
201+
max_soft_restarts=0 # Go straight to restore — no process to "soft restart"
202+
```
203+
204+
If you set `max_soft_restarts=3` on a file-integrity section, Guardian will restore from backup 3 times (one per cycle), then restore again on cycle 4. It's not harmful, but the extra attempts are redundant.
205+
206+
If the file-integrity section has a `restart_via` pointing to a process or service section, the cascade makes more sense — the soft restart phase restores the file and then restarts the delegated service:
207+
208+
```ini
209+
[my-pinned-config]
210+
type=file-integrity
211+
protected_files=/opt/myapp/config.json
212+
required_json_values=model.primary=gpt-4
213+
restart_via=my-api-server # After restoring the file, restart this service
214+
max_soft_restarts=0
215+
```
216+
217+
---
218+
219+
## File Integrity and JSON Validation
220+
221+
The `file-integrity` type goes beyond checking that files exist. It can validate JSON structure:
222+
223+
### `required_json_keys`
224+
225+
Checks that top-level keys exist in all `.json` files in `protected_files`:
226+
227+
```ini
228+
required_json_keys=session_limit,poll_interval,agents
229+
```
230+
231+
Guardian will flag files that are missing any of these keys.
232+
233+
### `required_json_values`
234+
235+
Pins specific values at dot-notation paths:
236+
237+
```ini
238+
required_json_values=agents.defaults.model.primary=gpt-4,database.port=5432
239+
```
240+
241+
Guardian traverses nested objects using the dot-separated path. If the actual value doesn't match the expected value, the file is considered corrupted and will be restored from backup.
242+
243+
This is useful for preventing agents from changing their own model configuration or other critical settings.
244+
245+
---
246+
247+
## Restart Delegation (`restart_via`)
248+
249+
Sometimes a process is a child of another service. Instead of restarting the child directly, you want to restart the parent:
250+
251+
```ini
252+
[ollama]
253+
enabled=true
254+
type=process
255+
description=Ollama LLM Server
256+
process_match=ollama serve
257+
required_ports=11434
258+
restart_via=my-gateway
259+
```
260+
261+
When `ollama` is unhealthy, Guardian restarts `my-gateway` instead — which will respawn ollama as a child process. The delegation follows the `restart_via` chain (can delegate multiple levels deep).
262+
263+
---
264+
265+
## Custom Health Commands (`health_cmd`)
266+
267+
For checks that don't fit the built-in patterns, use `health_cmd`:
268+
269+
```ini
270+
health_cmd=/opt/myapp/health-check.sh
271+
```
272+
273+
The command runs in bash. **Exit code 0 = healthy**, non-zero = unhealthy. The first 120 characters of stdout/stderr are included in the log message on failure.
274+
275+
Examples:
276+
- Check a database connection: `health_cmd=pg_isready -h localhost`
277+
- Validate a complex config: `health_cmd=python3 /opt/myapp/validate_config.py`
278+
- Check disk space: `health_cmd=test $(df --output=pcent /data | tail -1 | tr -d '% ') -lt 90`
279+
280+
---
281+
282+
## Immutable Flag Management
283+
284+
Guardian uses Linux `chattr +i` (immutable attribute) to prevent agents from modifying critical files. When immutable is set, even root must explicitly clear it before writing.
285+
286+
### Default behavior
287+
288+
- Backup files are always immutable (root-owned, mode 600)
289+
- Restored files get immutable set after restore
290+
- Guardian's own files (`guardian.sh`, config, service file) are kept immutable via `check_self_integrity()`
291+
292+
### `skip_immutable=true`
293+
294+
Some files need to be writable by the application (e.g., config files that agents legitimately update). Set `skip_immutable=true` to prevent Guardian from setting the immutable flag after restore:
295+
296+
```ini
297+
skip_immutable=true
298+
```
299+
300+
The file is still backed up and can be restored — it just won't be locked after restoration.
301+
302+
### Editing protected files
303+
304+
To edit a file Guardian is protecting:
305+
306+
```bash
307+
# 1. Clear the immutable flag
308+
sudo chattr -i /etc/guardian/guardian.conf
309+
310+
# 2. Edit the file
311+
sudo nano /etc/guardian/guardian.conf
312+
313+
# 3. Re-set the flag (or let Guardian do it on next cycle)
314+
sudo chattr +i /etc/guardian/guardian.conf
315+
316+
# 4. Restart guardian to pick up config changes
317+
sudo systemctl restart guardian
318+
```
319+
320+
---
321+
322+
## Backup System
323+
324+
Guardian maintains generational backups of all `protected_files` and `protected_service` files.
325+
326+
### Snapshot behavior
327+
328+
- On each healthy cycle, Guardian compares the live file to the latest backup
329+
- If the file has changed, it rotates existing backups and takes a new snapshot
330+
- Backups are stored in `/var/lib/guardian/backups/<section-name>/`
331+
332+
### Generations
333+
334+
With `backup_generations=5`, backups are named:
335+
336+
```
337+
settings.json ← current (most recent healthy state)
338+
settings.json.1 ← previous
339+
settings.json.2 ← two versions ago
340+
settings.json.3
341+
settings.json.4
342+
settings.json.5 ← oldest (deleted when a new snapshot is taken)
343+
```
344+
345+
### Restore behavior
346+
347+
When soft restarts are exhausted, Guardian restores all `protected_files` from the most recent backup (the un-numbered copy). File ownership is set to `start_user` or `systemd_user` from the section config.

0 commit comments

Comments
 (0)