|
| 1 | +# Health Checks Reference |
| 2 | + |
| 3 | +How Guardian monitors resources and recovers from failures. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## The Tier System |
| 8 | + |
| 9 | +Organize your config sections by dependency criticality. Guardian checks all sections every cycle, but tiers help you reason about what matters most: |
| 10 | + |
| 11 | +| Tier | Purpose | Example | |
| 12 | +|------|---------|---------| |
| 13 | +| **Tier 1** | Agent lifeline — agent is dead without these | Gateways, core services | |
| 14 | +| **Tier 2** | API routing — agent can't think without these | Proxies, API servers | |
| 15 | +| **Tier 3** | Memory & context — agent loses state without these | Databases, config files | |
| 16 | +| **Tier 4** | Helpers — degraded but alive without these | Monitoring, bots, utilities | |
| 17 | + |
| 18 | +Guardian doesn't enforce tiers — they're an organizational convention. Every enabled section gets checked every cycle regardless of tier. |
| 19 | + |
| 20 | +--- |
| 21 | + |
| 22 | +## Monitoring Types |
| 23 | + |
| 24 | +### `process` |
| 25 | + |
| 26 | +Matches a running process by command-line pattern (via `pgrep -f`). If the process isn't found, Guardian restarts it using `start_cmd`. |
| 27 | + |
| 28 | +```ini |
| 29 | +[my-api-server] |
| 30 | +enabled=true |
| 31 | +type=process |
| 32 | +description=My API Server |
| 33 | +process_match=uvicorn main:app.*8080 |
| 34 | +start_cmd=start.sh |
| 35 | +start_dir=/opt/myapp |
| 36 | +start_user=deploy |
| 37 | +start_venv=/opt/myapp/venv |
| 38 | +required_ports=8080 |
| 39 | +health_url=http://127.0.0.1:8080/health |
| 40 | +protected_files=/opt/myapp/main.py,/opt/myapp/start.sh |
| 41 | +max_soft_restarts=3 |
| 42 | +restart_grace=10 |
| 43 | +``` |
| 44 | + |
| 45 | +**Recovery flow:** |
| 46 | +1. Kill existing process (`pkill -f`, then `pkill -9` after 2s) |
| 47 | +2. Free required ports (`fuser -k`) |
| 48 | +3. Activate virtualenv if `start_venv` is set |
| 49 | +4. `cd` to `start_dir` if set |
| 50 | +5. Run `start_cmd` as `start_user` via `nohup` (`.py` files use `python3`, others use `bash`) |
| 51 | + |
| 52 | +### `systemd-user` |
| 53 | + |
| 54 | +Monitors a systemd user service via `systemctl --user is-active`. Restarts via `systemctl --user restart`. |
| 55 | + |
| 56 | +```ini |
| 57 | +[my-gateway] |
| 58 | +enabled=true |
| 59 | +type=systemd-user |
| 60 | +description=OpenClaw Gateway |
| 61 | +service=openclaw-gateway.service |
| 62 | +systemd_user=deploy |
| 63 | +required_ports=3100 |
| 64 | +health_port=3100 |
| 65 | +protected_files=/home/deploy/.openclaw/openclaw.json |
| 66 | +protected_service=/home/deploy/.config/systemd/user/openclaw-gateway.service |
| 67 | +max_soft_restarts=3 |
| 68 | +restart_grace=15 |
| 69 | +skip_immutable=true |
| 70 | +``` |
| 71 | + |
| 72 | +**Important:** `systemd_user` is required — Guardian needs to know which user's systemd instance to interact with. |
| 73 | + |
| 74 | +### `docker` |
| 75 | + |
| 76 | +Monitors a Docker container via `docker inspect`. Restarts via `docker restart`. |
| 77 | + |
| 78 | +```ini |
| 79 | +[my-database] |
| 80 | +enabled=true |
| 81 | +type=docker |
| 82 | +description=Vector Database |
| 83 | +container_name=qdrant-prod |
| 84 | +required_ports=6333 |
| 85 | +health_url=http://127.0.0.1:6333/healthz |
| 86 | +max_soft_restarts=3 |
| 87 | +restart_grace=15 |
| 88 | +``` |
| 89 | + |
| 90 | +### `file-integrity` |
| 91 | + |
| 92 | +Validates that files exist, are non-empty, and optionally contain valid JSON with required keys and values. Recovery restores from backup. |
| 93 | + |
| 94 | +```ini |
| 95 | +[my-settings] |
| 96 | +enabled=true |
| 97 | +type=file-integrity |
| 98 | +description=Application Settings |
| 99 | +protected_files=/opt/myapp/settings.json,/opt/myapp/config.yaml |
| 100 | +required_json_keys=session_limit,poll_interval |
| 101 | +required_json_values=database.host=localhost |
| 102 | +max_soft_restarts=1 |
| 103 | +restart_grace=5 |
| 104 | +``` |
| 105 | + |
| 106 | +--- |
| 107 | + |
| 108 | +## Config Key Reference |
| 109 | + |
| 110 | +### Common Keys (all types) |
| 111 | + |
| 112 | +| Key | Required | Default | Description | |
| 113 | +|-----|----------|---------|-------------| |
| 114 | +| `enabled` | yes | `false` | Must be `true` to monitor this section | |
| 115 | +| `type` | yes | — | `process`, `systemd-user`, `docker`, or `file-integrity` | |
| 116 | +| `description` | no | section name | Human-readable description for logs | |
| 117 | +| `max_soft_restarts` | no | `3` | Soft restarts before falling back to backup restore | |
| 118 | +| `restart_grace` | no | `10` | Seconds to wait after a restart before rechecking | |
| 119 | +| `restart_via` | no | — | Delegate restart to another section (by section name) | |
| 120 | +| `protected_files` | no | — | Comma-separated file paths to backup and monitor | |
| 121 | +| `protected_service` | no | — | Systemd service file path to backup | |
| 122 | +| `skip_immutable` | no | `false` | If `true`, don't set immutable flag on restored files | |
| 123 | +| `required_ports` | no | — | Comma-separated ports that must be listening | |
| 124 | +| `health_url` | no | — | URL that must return HTTP 2xx | |
| 125 | +| `health_port` | no | — | Port that must be listening (simpler than `required_ports`) | |
| 126 | +| `health_cmd` | no | — | Shell command that must exit 0 | |
| 127 | + |
| 128 | +### Process-Specific Keys |
| 129 | + |
| 130 | +| Key | Required | Default | Description | |
| 131 | +|-----|----------|---------|-------------| |
| 132 | +| `process_match` | yes | — | Pattern for `pgrep -f` matching | |
| 133 | +| `start_cmd` | yes | — | Script/binary to start (`.py` → python3, else bash) | |
| 134 | +| `start_dir` | no | — | Working directory for `start_cmd` | |
| 135 | +| `start_user` | no | `$(whoami)` | Unix user to run as. **Warning:** defaults to `root` when Guardian runs as a systemd service — always set this explicitly. | |
| 136 | +| `start_venv` | no | — | Python virtualenv path to activate before starting | |
| 137 | + |
| 138 | +### Systemd-User Keys |
| 139 | + |
| 140 | +| Key | Required | Default | Description | |
| 141 | +|-----|----------|---------|-------------| |
| 142 | +| `service` | yes | — | Systemd service unit name | |
| 143 | +| `systemd_user` | yes | — | Unix user whose systemd instance to manage | |
| 144 | + |
| 145 | +### Docker Keys |
| 146 | + |
| 147 | +| Key | Required | Default | Description | |
| 148 | +|-----|----------|---------|-------------| |
| 149 | +| `container_name` | yes | — | Docker container name | |
| 150 | + |
| 151 | +### File-Integrity Keys |
| 152 | + |
| 153 | +| Key | Required | Default | Description | |
| 154 | +|-----|----------|---------|-------------| |
| 155 | +| `required_json_keys` | no | — | Comma-separated top-level keys that must exist in JSON files | |
| 156 | +| `required_json_values` | no | — | Comma-separated `path=value` pairs to validate in JSON files | |
| 157 | + |
| 158 | +--- |
| 159 | + |
| 160 | +## The Recovery Cascade |
| 161 | + |
| 162 | +When Guardian detects an unhealthy resource, it follows this sequence: |
| 163 | + |
| 164 | +``` |
| 165 | +UNHEALTHY detected |
| 166 | + │ |
| 167 | + ├─ failure_count <= max_soft_restarts? |
| 168 | + │ └─ YES → Soft restart (just restart the service/process) |
| 169 | + │ Wait restart_grace seconds |
| 170 | + │ |
| 171 | + └─ NO (soft restarts exhausted) |
| 172 | + └─ Restore protected_files from backup |
| 173 | + Restart the service/process |
| 174 | + Wait restart_grace seconds |
| 175 | + Reset failure counter to 0 |
| 176 | +``` |
| 177 | + |
| 178 | +**Failure tracking** is persistent across cycles. Each section gets a failure counter file in `/var/lib/guardian/state/`. When a resource recovers (detected healthy after being unhealthy), the counter resets to 0. |
| 179 | + |
| 180 | +### Example timeline for `process` / `systemd-user` / `docker` |
| 181 | + |
| 182 | +With `max_soft_restarts=3`: |
| 183 | + |
| 184 | +| Cycle | Status | Action | |
| 185 | +|-------|--------|--------| |
| 186 | +| 1 | Unhealthy | Soft restart #1 (just restart the service) | |
| 187 | +| 2 | Unhealthy | Soft restart #2 | |
| 188 | +| 3 | Unhealthy | Soft restart #3 | |
| 189 | +| 4 | Unhealthy | Restore files from backup, then restart service | |
| 190 | +| 5 | Healthy | Reset counter, take snapshot | |
| 191 | + |
| 192 | +### How `file-integrity` differs |
| 193 | + |
| 194 | +For `file-integrity` sections, the "restart" action IS a backup restore (there's no process to restart — the recovery is restoring the file). This means the soft restart and the backup-restore fallback both do the same thing: restore from backup. |
| 195 | + |
| 196 | +In practice, use `max_soft_restarts=0` for file-integrity sections. This skips the redundant "soft restart" phase and goes straight to restore: |
| 197 | + |
| 198 | +```ini |
| 199 | +[my-config-files] |
| 200 | +type=file-integrity |
| 201 | +max_soft_restarts=0 # Go straight to restore — no process to "soft restart" |
| 202 | +``` |
| 203 | + |
| 204 | +If you set `max_soft_restarts=3` on a file-integrity section, Guardian will restore from backup 3 times (one per cycle), then restore again on cycle 4. It's not harmful, but the extra attempts are redundant. |
| 205 | + |
| 206 | +If the file-integrity section has a `restart_via` pointing to a process or service section, the cascade makes more sense — the soft restart phase restores the file and then restarts the delegated service: |
| 207 | + |
| 208 | +```ini |
| 209 | +[my-pinned-config] |
| 210 | +type=file-integrity |
| 211 | +protected_files=/opt/myapp/config.json |
| 212 | +required_json_values=model.primary=gpt-4 |
| 213 | +restart_via=my-api-server # After restoring the file, restart this service |
| 214 | +max_soft_restarts=0 |
| 215 | +``` |
| 216 | + |
| 217 | +--- |
| 218 | + |
| 219 | +## File Integrity and JSON Validation |
| 220 | + |
| 221 | +The `file-integrity` type goes beyond checking that files exist. It can validate JSON structure: |
| 222 | + |
| 223 | +### `required_json_keys` |
| 224 | + |
| 225 | +Checks that top-level keys exist in all `.json` files in `protected_files`: |
| 226 | + |
| 227 | +```ini |
| 228 | +required_json_keys=session_limit,poll_interval,agents |
| 229 | +``` |
| 230 | + |
| 231 | +Guardian will flag files that are missing any of these keys. |
| 232 | + |
| 233 | +### `required_json_values` |
| 234 | + |
| 235 | +Pins specific values at dot-notation paths: |
| 236 | + |
| 237 | +```ini |
| 238 | +required_json_values=agents.defaults.model.primary=gpt-4,database.port=5432 |
| 239 | +``` |
| 240 | + |
| 241 | +Guardian traverses nested objects using the dot-separated path. If the actual value doesn't match the expected value, the file is considered corrupted and will be restored from backup. |
| 242 | + |
| 243 | +This is useful for preventing agents from changing their own model configuration or other critical settings. |
| 244 | + |
| 245 | +--- |
| 246 | + |
| 247 | +## Restart Delegation (`restart_via`) |
| 248 | + |
| 249 | +Sometimes a process is a child of another service. Instead of restarting the child directly, you want to restart the parent: |
| 250 | + |
| 251 | +```ini |
| 252 | +[ollama] |
| 253 | +enabled=true |
| 254 | +type=process |
| 255 | +description=Ollama LLM Server |
| 256 | +process_match=ollama serve |
| 257 | +required_ports=11434 |
| 258 | +restart_via=my-gateway |
| 259 | +``` |
| 260 | + |
| 261 | +When `ollama` is unhealthy, Guardian restarts `my-gateway` instead — which will respawn ollama as a child process. The delegation follows the `restart_via` chain (can delegate multiple levels deep). |
| 262 | + |
| 263 | +--- |
| 264 | + |
| 265 | +## Custom Health Commands (`health_cmd`) |
| 266 | + |
| 267 | +For checks that don't fit the built-in patterns, use `health_cmd`: |
| 268 | + |
| 269 | +```ini |
| 270 | +health_cmd=/opt/myapp/health-check.sh |
| 271 | +``` |
| 272 | + |
| 273 | +The command runs in bash. **Exit code 0 = healthy**, non-zero = unhealthy. The first 120 characters of stdout/stderr are included in the log message on failure. |
| 274 | + |
| 275 | +Examples: |
| 276 | +- Check a database connection: `health_cmd=pg_isready -h localhost` |
| 277 | +- Validate a complex config: `health_cmd=python3 /opt/myapp/validate_config.py` |
| 278 | +- Check disk space: `health_cmd=test $(df --output=pcent /data | tail -1 | tr -d '% ') -lt 90` |
| 279 | + |
| 280 | +--- |
| 281 | + |
| 282 | +## Immutable Flag Management |
| 283 | + |
| 284 | +Guardian uses Linux `chattr +i` (immutable attribute) to prevent agents from modifying critical files. When immutable is set, even root must explicitly clear it before writing. |
| 285 | + |
| 286 | +### Default behavior |
| 287 | + |
| 288 | +- Backup files are always immutable (root-owned, mode 600) |
| 289 | +- Restored files get immutable set after restore |
| 290 | +- Guardian's own files (`guardian.sh`, config, service file) are kept immutable via `check_self_integrity()` |
| 291 | + |
| 292 | +### `skip_immutable=true` |
| 293 | + |
| 294 | +Some files need to be writable by the application (e.g., config files that agents legitimately update). Set `skip_immutable=true` to prevent Guardian from setting the immutable flag after restore: |
| 295 | + |
| 296 | +```ini |
| 297 | +skip_immutable=true |
| 298 | +``` |
| 299 | + |
| 300 | +The file is still backed up and can be restored — it just won't be locked after restoration. |
| 301 | + |
| 302 | +### Editing protected files |
| 303 | + |
| 304 | +To edit a file Guardian is protecting: |
| 305 | + |
| 306 | +```bash |
| 307 | +# 1. Clear the immutable flag |
| 308 | +sudo chattr -i /etc/guardian/guardian.conf |
| 309 | + |
| 310 | +# 2. Edit the file |
| 311 | +sudo nano /etc/guardian/guardian.conf |
| 312 | + |
| 313 | +# 3. Re-set the flag (or let Guardian do it on next cycle) |
| 314 | +sudo chattr +i /etc/guardian/guardian.conf |
| 315 | + |
| 316 | +# 4. Restart guardian to pick up config changes |
| 317 | +sudo systemctl restart guardian |
| 318 | +``` |
| 319 | + |
| 320 | +--- |
| 321 | + |
| 322 | +## Backup System |
| 323 | + |
| 324 | +Guardian maintains generational backups of all `protected_files` and `protected_service` files. |
| 325 | + |
| 326 | +### Snapshot behavior |
| 327 | + |
| 328 | +- On each healthy cycle, Guardian compares the live file to the latest backup |
| 329 | +- If the file has changed, it rotates existing backups and takes a new snapshot |
| 330 | +- Backups are stored in `/var/lib/guardian/backups/<section-name>/` |
| 331 | + |
| 332 | +### Generations |
| 333 | + |
| 334 | +With `backup_generations=5`, backups are named: |
| 335 | + |
| 336 | +``` |
| 337 | +settings.json ← current (most recent healthy state) |
| 338 | +settings.json.1 ← previous |
| 339 | +settings.json.2 ← two versions ago |
| 340 | +settings.json.3 |
| 341 | +settings.json.4 |
| 342 | +settings.json.5 ← oldest (deleted when a new snapshot is taken) |
| 343 | +``` |
| 344 | + |
| 345 | +### Restore behavior |
| 346 | + |
| 347 | +When soft restarts are exhausted, Guardian restores all `protected_files` from the most recent backup (the un-numbered copy). File ownership is set to `start_user` or `systemd_user` from the section config. |
0 commit comments