F1-3 Agent Lifecycle State Machine RFI (2026-03-01)

Status: RFI complete, implementation planning ready. GitHub issue: #2308 Linear: RMN-256

Summary

ZeroClaw currently has strong component supervision and health snapshots, but it does not expose a formal agent lifecycle state model. This RFI defines a lifecycle FSM, transition contract, synchronization model, persistence posture, and migration path that can be implemented without changing existing daemon reliability behavior.

Current-State Findings

Existing behavior that already works

src/daemon/mod.rs supervises gateway/channels/heartbeat/scheduler with restart backoff.
src/health/mod.rs tracks per-component status, last_ok, last_error, and restart_count.
src/agent/session.rs persists conversational history with memory/SQLite backends and TTL cleanup.
src/agent/loop_.rs and src/agent/agent.rs provide bounded per-turn execution loops.

Gaps blocking lifecycle consistency

No typed lifecycle enum for the agent runtime (or per-session runtime state).
No validated transition guard rails (invalid transitions are not prevented centrally).
Health state and lifecycle state are conflated (ok/error are not full lifecycle semantics).
Persistence only covers health snapshots and conversation history, not lifecycle transitions.
No single integration contract for daemon, channels, supervisor, and health endpoint consumers.

Proposed Lifecycle Model

State definitions

Created: runtime object exists but not started.
Starting: dependencies are being initialized.
Running: normal operation, accepting and processing work.
Degraded: still running but with elevated failure/restart signals.
Suspended: intentionally paused (manual pause, e-stop, or maintenance gate).
Backoff: recovering after crash/error; restart cooldown active.
Terminating: graceful shutdown in progress.
Terminated: clean shutdown completed.
Crashed: unrecoverable failure after retry budget is exhausted.

State diagram

stateDiagram-v2
    [*] --> Created
    Created --> Starting: daemon run/start
    Starting --> Running: init_ok
    Starting --> Backoff: init_fail
    Running --> Degraded: component_error_threshold
    Degraded --> Running: recovered
    Running --> Suspended: pause_or_estop
    Degraded --> Suspended: pause_or_estop
    Suspended --> Running: resume
    Backoff --> Starting: retry_after_backoff
    Backoff --> Crashed: retry_budget_exhausted
    Running --> Terminating: shutdown_signal
    Degraded --> Terminating: shutdown_signal
    Suspended --> Terminating: shutdown_signal
    Terminating --> Terminated: shutdown_complete
    Crashed --> Terminating: manual_stop

Transition table

From	Trigger	Guard	To	Action
`Created`	daemon start	config valid	`Starting`	emit lifecycle event
`Starting`	init success	all required components healthy	`Running`	clear restart streak
`Starting`	init failure	retry budget available	`Backoff`	increment restart streak
`Running`	component errors	restart streak >= threshold	`Degraded`	set degraded cause
`Degraded`	recovery success	error window clears	`Running`	clear degraded cause
`Running`/`Degraded`	pause/e-stop	operator or policy signal	`Suspended`	stop intake/execution
`Suspended`	resume	policy allows	`Running`	re-enable intake
`Backoff`	retry timer	retry budget available	`Starting`	start component init
`Backoff`	retry exhausted	no retries left	`Crashed`	emit terminal failure event
non-terminal states	shutdown	signal received	`Terminating`	drain and stop workers
`Terminating`	done	all workers stopped	`Terminated`	persist final snapshot

Implementation Approach

State representation

Add a dedicated lifecycle type in runtime/daemon scope:

enum AgentLifecycleState {
    Created,
    Starting,
    Running,
    Degraded { cause: String },
    Suspended { reason: String },
    Backoff { retry_in_ms: u64, attempt: u32 },
    Terminating,
    Terminated,
    Crashed { reason: String },
}

Synchronization model

Use a single LifecycleRegistry (Arc<RwLock<...>>) owned by daemon runtime.
Route all lifecycle writes through transition(from, to, trigger) with guard checks.
Emit transition events from one place, then fan out to health snapshot and observability.
Reject invalid transitions at runtime and log them as policy violations.

Persistence Decision

Decision: hybrid persistence.

Runtime source of truth: in-memory lifecycle registry for low-latency transitions.
Durable checkpoint: persisted lifecycle snapshot alongside daemon_state.json.
Optional append-only transition journal (lifecycle_events.jsonl) for audit and forensics.

Rationale:

In-memory state keeps current daemon behavior fast and simple.
Persistent checkpoint enables status restoration after restart and improves operator clarity.
Event journal is valuable for post-incident analysis without changing runtime control flow.

Integration Points

src/daemon/mod.rs
- wrap supervisor start/failure/backoff/shutdown with explicit lifecycle transitions.
src/health/mod.rs
- expose lifecycle state in health snapshot without replacing component-level health detail.
src/main.rs (status, restart, e-stop surfaces)
- render lifecycle state and transition reason in CLI output.
src/channels/mod.rs and channel workers
- gate message intake when lifecycle is Suspended, Terminating, Crashed, or Terminated.
src/agent/session.rs
- keep session history semantics unchanged; add optional link from session to runtime lifecycle id.

Migration Plan

Phase 1: Non-breaking state plumbing

Add lifecycle enum/registry and default transitions in daemon startup/shutdown.
Include lifecycle state in health JSON output.
Keep existing component health fields unchanged.

Phase 2: Supervisor transition wiring

Convert supervisor restart/error signals into lifecycle transitions.
Add backoff metadata (attempt, retry_in_ms) to lifecycle snapshots.

Phase 3: Intake gating + operator controls

Enforce channel/gateway intake gating by lifecycle state.
Surface lifecycle controls and richer status output in CLI.

Phase 4: Persistence + event journal

Persist snapshot and optional JSONL transition events.
Add recovery behavior for daemon restart from persisted snapshot.

Verification and Testing Plan

Unit tests

transition guard tests for all valid/invalid state pairs.
lifecycle-to-health serialization tests.
persistence round-trip tests for snapshot and event journal.

Integration tests

daemon startup failure -> backoff -> recovery path.
repeated failure -> Crashed transition.
suspend/resume behavior for channel intake and scheduler activity.

Chaos/failure tests

component panic/exit simulation under supervisor.
rapid restart storm protection and state consistency checks.

Risks and Mitigations

Risk	Impact	Mitigation
Overlap between health and lifecycle semantics	Operator confusion	Keep both domains explicit and documented
Invalid transition bugs during rollout	Runtime inconsistency	Central transition API with guard checks
Excessive persistence I/O	Throughput impact	snapshot throttling + async event writes
Channel behavior regressions on suspend	Message loss	add intake gating tests and dry-run mode

Implementation Readiness Checklist

State diagram and transition table documented.
State representation and synchronization approach selected.
Persistence strategy documented.
Integration points and migration plan documented.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

F1-3 Agent Lifecycle State Machine RFI (2026-03-01)

Summary

Current-State Findings

Existing behavior that already works

Gaps blocking lifecycle consistency

Proposed Lifecycle Model

State definitions

State diagram

Transition table

Implementation Approach

State representation

Synchronization model

Persistence Decision

Integration Points

Migration Plan

Phase 1: Non-breaking state plumbing

Phase 2: Supervisor transition wiring

Phase 3: Intake gating + operator controls

Phase 4: Persistence + event journal

Verification and Testing Plan

Unit tests

Integration tests

Chaos/failure tests

Risks and Mitigations

Implementation Readiness Checklist

FilesExpand file tree

f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md

Latest commit

History

f1-3-agent-lifecycle-state-machine-rfi-2026-03-01.md

File metadata and controls

F1-3 Agent Lifecycle State Machine RFI (2026-03-01)

Summary

Current-State Findings

Existing behavior that already works

Gaps blocking lifecycle consistency

Proposed Lifecycle Model

State definitions

State diagram

Transition table

Implementation Approach

State representation

Synchronization model

Persistence Decision

Integration Points

Migration Plan

Phase 1: Non-breaking state plumbing

Phase 2: Supervisor transition wiring

Phase 3: Intake gating + operator controls

Phase 4: Persistence + event journal

Verification and Testing Plan

Unit tests

Integration tests

Chaos/failure tests

Risks and Mitigations

Implementation Readiness Checklist