Optio drives task and pod state through a Kubernetes-style reconciliation loop. A worker observes the world, runs a pure decision function, and applies a single typed action — gated by compare-and-swap so concurrent observers cannot trample each other.
This guide covers what the loop does, the configuration surface, and how to debug a decision.
Before v0.2.0, state transitions lived inside the workers that produced them: task-worker advanced provisioning and running, pr-watcher-worker advanced PR state, workflow-worker advanced standalone runs, repo-cleanup-worker failed crashed pods. Each was correct in isolation but together they had three structural problems:
- Lost events left rows stuck. A dropped websocket message or a crashed subscriber could strand a task in
provisioningforever. - No single decision point. Two workers could attempt overlapping transitions on the same row.
- Hard to test. Decisions were tangled with I/O, so most logic was exercised only through integration tests.
The reconciler centralizes the decision step. Workers still produce events and execute side effects (spawning pods, running agents, polling GitHub), but PR-driven transitions, auto-merge, auto-resume, review launch, stall and pod-death detection, and control-intent handling all flow through the reconciler.
Event source (state change, webhook, periodic resync)
│
▼
reconcile-queue (BullMQ, dedup by `${kind}__${id}`)
│
▼
reconcile-worker pops { kind: "repo" | "standalone", id }
│
▼
buildWorldSnapshot(ref)
- read run row, pod state, PR status, deps, capacity, heartbeat
- run reads in parallel, record per-source errors
│
▼
reconcileRepo(snapshot) ← pure function, no I/O
reconcileStandalone(snapshot) ← pure function, no I/O
│
▼ (Action union)
executeAction(action, snapshot)
- state mutations: CAS-update row, then delegate to
taskService.transitionTask() for fan-out
- side effects: enqueue agent jobs, call platform.merge,
launch review subtask, etc.
│
▼
Telemetry: reconcile.decision { kind, id, action, reason, outcome }
A snapshot is frozen — once read, the decision function only sees that point-in-time view, and the executor's CAS check refuses to write if the row moved underneath it. A failed CAS re-enqueues the job for a fresh pass rather than retrying with stale data.
Four run kinds, four state machines. The RunKind discriminator is "repo" | "standalone" | "pr-review" | "persistent-agent" (see packages/shared/src/reconcile/types.ts).
States: pending, waiting_on_deps, queued, provisioning, running, needs_attention, pr_opened, completed, failed, cancelled.
The repo decision function evaluates capacity, dependency state, pod health, PR status (CI, review, merge), heartbeat staleness, and the per-task auto-merge / auto-review settings. Actions it can return:
| Action | When |
|---|---|
transition |
A state change is justified by the snapshot |
launchReview |
PR is open, CI passed, review agent is enabled and not yet running |
autoMergePr |
PR is approved, CI green, auto-merge enabled |
resumeAgent |
Reviewer requested changes, auto-resume enabled |
requeueForAgent |
Task is queued but capacity is now available |
patchStatus |
Non-state metadata needs to be written (e.g., heartbeat, attempts) |
deferWithBackoff |
A required read failed; defer and retry with exponential backoff |
noop |
The world matches the desired state |
States: queued, running, completed, failed. Simpler — the decision function reads pod state and decides whether to enqueue the agent, transition, or back off.
External PR reviews — code-review subtasks for PRs that aren't tied to a Repo Task. The decision function tracks pod state, review-agent completion, and re-runs on push events when configured. Lifted out of the Repo Task pipeline so external PRs can be reviewed without going through the worktree flow.
The state machine is cyclic rather than terminal — agents return to idle after each successful turn rather than transitioning to a terminal state.
States: idle, queued, provisioning, running, paused, failed, archived.
idle ── pending msg / intent ──▶ queued ──▶ provisioning ──▶ running
▲ │
└────────────────── turn halted (success) ─────────────────────┘
The Persistent Agent decision function (reconcile-persistent-agent.ts) considers: pending inbox messages, control intent (pause, resume, restart, archive), pod lifecycle mode (always-on / sticky / on-demand), pod warm-window (keep_warm_until), consecutive_failures against consecutive_failure_limit, and reconcile backoff. Actions it can return include enqueueTurn, provisionPod, markIdle, pausePod, archive, failPermanently, plus the standard patchStatus / deferWithBackoff / noop.
paused and failed require a manual resume control intent before the agent will act again. archived is terminal — the row is kept for history but no further turns are possible.
Several things enqueue a reconcile job:
- State-change events.
taskService.transitionTask,workflow-worker'stransitionRunhelper, and the persistent-agent worker's per-turn finalizer all fireenqueueReconcileafter every successful transition. Anywhere the codebase changes a run's state, the reconciler is woken within milliseconds. - PR poll updates. The
pr-watcher-workerpolls open PRs every 30s, writes refreshedprState/prChecksStatus/prReviewStatus/prReviewCommentsto the row, then enqueues a reconcile so the decision function sees the new PR data. - Pod-health events. When
repo-cleanup-workerdetects a crashed or OOM-killed pod, it marks worktrees dirty and enqueues a reconcile for each affected task — the reconciler observespod.phase=errorfrom the snapshot and fires the FAILED transition. - Persistent Agent wakes.
wakeAgent()(called from the inbox API, the inter-agent HTTP API, the workflow-trigger worker ontarget_type='persistent_agent', and the cleanup worker on warm-window expiry) enqueues a reconcile so the decision function picks up new messages or intents. - Control intents. UI/API actions that set
control_intent(cancel,retry,resume,restart,pause,archive) enqueue a reconcile so the decision function applies the intent. - Periodic resync. Every 5 minutes (configurable) the resync worker scans non-terminal runs across all four tables and enqueues each one. Safety net for any signal that's lost.
The queue dedups by ${kind}__${id}, so multiple producers do not amplify load.
| Env var | Default | Purpose |
|---|---|---|
OPTIO_RECONCILE_CONCURRENCY |
4 |
Parallel reconcile jobs |
OPTIO_RECONCILE_LOCK_MS |
30000 |
BullMQ job lock — hard kill for runaway jobs |
OPTIO_RECONCILE_RESYNC_INTERVAL |
300000 (5 min) |
Full sweep cadence for non-terminal runs |
OPTIO_STALL_THRESHOLD_MS |
900000 (15 min) |
Heartbeat staleness threshold the decision function uses |
OPTIO_MAX_AUTO_RESUMES |
10 |
Cap on auto_resume_* events between manual actions |
The tasks, workflow_runs, pr_reviews, and persistent_agents tables each carry the same three reconcile columns (tasks and workflow_runs got them via migration 1776686400_reconcile_columns.sql; persistent_agents includes them in its own migration):
| Column | Type | Purpose |
|---|---|---|
control_intent |
text |
Operator-set intent: cancel, retry, resume, restart |
reconcile_backoff_until |
timestamptz |
Defer further reconciliation until this time |
reconcile_attempts |
integer |
Backoff exponent counter |
All three are nullable / default zero, so no backfill was required.
| File | Role |
|---|---|
packages/shared/src/reconcile/types.ts |
RunKind, RunRef, WorldSnapshot, Action unions |
packages/shared/src/reconcile/reconcile-repo.ts |
Pure decision logic for repo runs |
packages/shared/src/reconcile/reconcile-standalone.ts |
Pure decision logic for standalone runs |
packages/shared/src/reconcile/reconcile-pr-review.ts |
Pure decision logic for external PR reviews |
packages/shared/src/reconcile/reconcile-persistent-agent.ts |
Pure decision logic for Persistent Agents |
apps/api/src/workers/reconcile-worker.ts |
BullMQ consumer + resync worker |
apps/api/src/services/reconcile-snapshot.ts |
Builds the frozen world view |
apps/api/src/services/reconcile-executor.ts |
CAS-gated mutations; delegates to taskService |
apps/api/src/services/reconcile-queue.ts |
Queue setup and dedup-aware enqueue helpers |
apps/api/src/db/migrations/1776686400_reconcile_columns.sql |
Schema columns for tasks + workflow_runs |
Decision logic is exhaustively tested in packages/shared/src/reconcile/*.test.ts. Because the decision functions are pure, every state machine edge case is covered without mocking I/O.
Each decision emits a reconcile.decision log line:
{ kind: "repo", id: "...", action: "transition", from: "running",
to: "pr_opened", reason: "pr-detected", outcome: "applied" }
outcome is one of applied, shadow, stale (CAS failed, re-enqueued), deferred (backoff), or error. To understand why the reconciler chose an action, find the matching reconcile.snapshot log immediately preceding it — it contains the inputs the decision function saw.