Status: RFI complete, implementation planning ready. GitHub issue: #2308 Linear: RMN-256
ZeroClaw currently has strong component supervision and health snapshots, but it does not expose a formal agent lifecycle state model. This RFI defines a lifecycle FSM, transition contract, synchronization model, persistence posture, and migration path that can be implemented without changing existing daemon reliability behavior.
src/daemon/mod.rssupervises gateway/channels/heartbeat/scheduler with restart backoff.src/health/mod.rstracks per-componentstatus,last_ok,last_error, andrestart_count.src/agent/session.rspersists conversational history with memory/SQLite backends and TTL cleanup.src/agent/loop_.rsandsrc/agent/agent.rsprovide bounded per-turn execution loops.
- No typed lifecycle enum for the agent runtime (or per-session runtime state).
- No validated transition guard rails (invalid transitions are not prevented centrally).
- Health state and lifecycle state are conflated (
ok/errorare not full lifecycle semantics). - Persistence only covers health snapshots and conversation history, not lifecycle transitions.
- No single integration contract for daemon, channels, supervisor, and health endpoint consumers.
Created: runtime object exists but not started.Starting: dependencies are being initialized.Running: normal operation, accepting and processing work.Degraded: still running but with elevated failure/restart signals.Suspended: intentionally paused (manual pause, e-stop, or maintenance gate).Backoff: recovering after crash/error; restart cooldown active.Terminating: graceful shutdown in progress.Terminated: clean shutdown completed.Crashed: unrecoverable failure after retry budget is exhausted.
stateDiagram-v2
[*] --> Created
Created --> Starting: daemon run/start
Starting --> Running: init_ok
Starting --> Backoff: init_fail
Running --> Degraded: component_error_threshold
Degraded --> Running: recovered
Running --> Suspended: pause_or_estop
Degraded --> Suspended: pause_or_estop
Suspended --> Running: resume
Backoff --> Starting: retry_after_backoff
Backoff --> Crashed: retry_budget_exhausted
Running --> Terminating: shutdown_signal
Degraded --> Terminating: shutdown_signal
Suspended --> Terminating: shutdown_signal
Terminating --> Terminated: shutdown_complete
Crashed --> Terminating: manual_stop
| From | Trigger | Guard | To | Action |
|---|---|---|---|---|
Created |
daemon start | config valid | Starting |
emit lifecycle event |
Starting |
init success | all required components healthy | Running |
clear restart streak |
Starting |
init failure | retry budget available | Backoff |
increment restart streak |
Running |
component errors | restart streak >= threshold | Degraded |
set degraded cause |
Degraded |
recovery success | error window clears | Running |
clear degraded cause |
Running/Degraded |
pause/e-stop | operator or policy signal | Suspended |
stop intake/execution |
Suspended |
resume | policy allows | Running |
re-enable intake |
Backoff |
retry timer | retry budget available | Starting |
start component init |
Backoff |
retry exhausted | no retries left | Crashed |
emit terminal failure event |
| non-terminal states | shutdown | signal received | Terminating |
drain and stop workers |
Terminating |
done | all workers stopped | Terminated |
persist final snapshot |
Add a dedicated lifecycle type in runtime/daemon scope:
enum AgentLifecycleState {
Created,
Starting,
Running,
Degraded { cause: String },
Suspended { reason: String },
Backoff { retry_in_ms: u64, attempt: u32 },
Terminating,
Terminated,
Crashed { reason: String },
}- Use a single
LifecycleRegistry(Arc<RwLock<...>>) owned by daemon runtime. - Route all lifecycle writes through
transition(from, to, trigger)with guard checks. - Emit transition events from one place, then fan out to health snapshot and observability.
- Reject invalid transitions at runtime and log them as policy violations.
Decision: hybrid persistence.
- Runtime source of truth: in-memory lifecycle registry for low-latency transitions.
- Durable checkpoint: persisted lifecycle snapshot alongside
daemon_state.json. - Optional append-only transition journal (
lifecycle_events.jsonl) for audit and forensics.
Rationale:
- In-memory state keeps current daemon behavior fast and simple.
- Persistent checkpoint enables status restoration after restart and improves operator clarity.
- Event journal is valuable for post-incident analysis without changing runtime control flow.
src/daemon/mod.rs- wrap supervisor start/failure/backoff/shutdown with explicit lifecycle transitions.
src/health/mod.rs- expose lifecycle state in health snapshot without replacing component-level health detail.
src/main.rs(status,restart, e-stop surfaces)- render lifecycle state and transition reason in CLI output.
src/channels/mod.rsand channel workers- gate message intake when lifecycle is
Suspended,Terminating,Crashed, orTerminated.
- gate message intake when lifecycle is
src/agent/session.rs- keep session history semantics unchanged; add optional link from session to runtime lifecycle id.
- Add lifecycle enum/registry and default transitions in daemon startup/shutdown.
- Include lifecycle state in health JSON output.
- Keep existing component health fields unchanged.
- Convert supervisor restart/error signals into lifecycle transitions.
- Add backoff metadata (
attempt,retry_in_ms) to lifecycle snapshots.
- Enforce channel/gateway intake gating by lifecycle state.
- Surface lifecycle controls and richer status output in CLI.
- Persist snapshot and optional JSONL transition events.
- Add recovery behavior for daemon restart from persisted snapshot.
- transition guard tests for all valid/invalid state pairs.
- lifecycle-to-health serialization tests.
- persistence round-trip tests for snapshot and event journal.
- daemon startup failure -> backoff -> recovery path.
- repeated failure ->
Crashedtransition. - suspend/resume behavior for channel intake and scheduler activity.
- component panic/exit simulation under supervisor.
- rapid restart storm protection and state consistency checks.
| Risk | Impact | Mitigation |
|---|---|---|
| Overlap between health and lifecycle semantics | Operator confusion | Keep both domains explicit and documented |
| Invalid transition bugs during rollout | Runtime inconsistency | Central transition API with guard checks |
| Excessive persistence I/O | Throughput impact | snapshot throttling + async event writes |
| Channel behavior regressions on suspend | Message loss | add intake gating tests and dry-run mode |
- State diagram and transition table documented.
- State representation and synchronization approach selected.
- Persistence strategy documented.
- Integration points and migration plan documented.