Supervision & Message Passing

This document describes Opal's OTP supervision architecture, process lifecycle, message passing patterns, and the design rationale behind each decision.

Supervision Tree

Opal uses a per-session supervision tree so that every active session is a fully isolated unit — its own processes, its own failure domain, its own cleanup.

graph TD
    OpalSup["Opal.Supervisor<br/><i>:rest_for_one</i>"]
    AgentRegistry["Opal.Registry<br/><i>:unique — agent/session lookup</i>"]
    Registry["Opal.Events.Registry<br/><i>:duplicate — pubsub backbone</i>"]
    SessionSup["Opal.SessionSupervisor<br/><i>DynamicSupervisor :one_for_one</i>"]
    Stdio["Opal.RPC.Server<br/><i>optional, gated by :start_rpc</i>"]

    OpalSup --> AgentRegistry
    OpalSup --> Registry
    OpalSup --> SessionSup
    OpalSup -.-> Stdio

    subgraph session_a["Session &quot;a1b2c3&quot;"]
        SSA["SessionServer<br/><i>Supervisor :rest_for_one</i>"]
        TSA["Task.Supervisor<br/><i>tool execution</i>"]
        DSA["DynamicSupervisor<br/><i>sub-agents</i>"]
        SesA["Opal.Session<br/><i>persistence (optional)</i>"]
        AgentA["Opal.Agent<br/><i>agent loop</i>"]
        SSA --> TSA
        SSA --> DSA
        SSA --> SesA
        SSA --> AgentA
    end

    subgraph session_b["Session &quot;d4e5f6&quot;"]
        SSB["SessionServer<br/><i>Supervisor :rest_for_one</i>"]
        TSB["Task.Supervisor"]
        DSB["DynamicSupervisor"]
        AgentB["Opal.Agent"]
        SSB --> TSB
        SSB --> DSB
        SSB --> AgentB
    end

    SessionSup --> SSA
    SessionSup --> SSB

When sub-agents are active

Sub-agents are regular Opal.Agent processes started under the session's own DynamicSupervisor. They share the parent's Task.Supervisor for tool execution but cannot spawn further sub-agents (depth = 1).

graph TD
    SS["SessionServer &quot;a1b2c3&quot;<br/><i>:rest_for_one</i>"]
    TS["Task.Supervisor<br/><i>shared by parent + sub-agents</i>"]
    DS["DynamicSupervisor<br/><i>owns sub-agent processes</i>"]
    Ses["Opal.Session<br/><i>conversation tree (optional)</i>"]
    PA["Opal.Agent<br/><i>parent agent loop</i>"]
    SubA["Opal.Agent &quot;sub-x1y2z3&quot;<br/><i>sub-agent A</i>"]
    SubB["Opal.Agent &quot;sub-p4q5r6&quot;<br/><i>sub-agent B</i>"]

    SS --> TS
    SS --> DS
    SS --> Ses
    SS --> PA
    DS --> SubA
    DS --> SubB

Process Roles

`Opal.Registry`

A Registry with :unique keys used for looking up agent and session processes by session ID. Processes register with keys like {:agent, session_id} and {:session, session_id}, enabling direct lookup without walking supervisor children.

`Opal.Events.Registry`

A Registry with :duplicate keys. Any process can subscribe to a session ID and receive events. This is the pubsub backbone — everything else is per-session. The registry never holds state; it simply routes messages.

`Opal.RPC.Server` (optional)

The stdio JSON-RPC transport. Started by default but can be disabled by setting config :opal, start_rpc: false — useful when embedding the core library as an SDK without stdio transport. See rpc.md for details.

`Opal.SessionSupervisor`

A DynamicSupervisor that acts as the container for all active sessions. When Opal.start_session/1 is called, a new SessionServer child is started here. When Opal.stop_session/1 is called, the entire SessionServer subtree is terminated — one call cleans up the agent, all running tools, all sub-agents, and the session store.

`Opal.SessionServer`

A per-session Supervisor using the :rest_for_one strategy. Children are started in order:

Task.Supervisor — executes tool calls as supervised tasks
DynamicSupervisor — manages sub-agent processes
Opal.Session — conversation persistence (optional, started when session: true)
Opal.Agent — the agent loop

The :rest_for_one strategy means if the Task.Supervisor or DynamicSupervisor crashes, the Agent (which depends on them) is restarted too. But a crash in the Agent does not affect the supervisors above it.

Each child is registered via Opal.Registry for discoverability:

Process	Registry key
Task.Supervisor	`{:tool_sup, session_id}`
DynamicSupervisor	`{:sub_agent_sup, session_id}`
Session	`{:session, session_id}`

`Opal.Agent`

A :gen_statem process that implements the core agent loop:

Receive a user prompt (:call)
Stream an LLM response via the configured Provider
If the LLM returns tool calls → execute them via supervised tasks → loop to step 2
If the LLM returns text only → broadcast agent_end → go idle

The Agent holds references to its session-local tool_supervisor and sub_agent_supervisor in its state — it never touches global process names.

`Opal.Session`

A GenServer backed by an ETS table that stores conversation messages in a tree structure (each message has a parent_id). Supports branching — rewinding to any past message and forking the conversation. Persistence is via DETS.

Message Passing

Opal uses three distinct message passing patterns, each chosen for a specific purpose.

1. Synchronous Calls — State Access

sequenceDiagram
    participant Caller
    participant Agent

    Caller->>Agent: Agent.get_state/1 (:gen_statem.call)
    Agent-->>Caller: %Agent.State{}

Used for: Agent.get_state/1, Session.append/2, Session.get_path/1

These are synchronous server calls — the caller blocks until the server replies. Used when the caller needs a consistent snapshot of state.

Key design decision: Tool tasks never call Agent.get_state(agent_pid) during execution. Instead, the Agent snapshots its state into the tool execution context before dispatching tasks:

flowchart LR
    Agent["Agent<br/><i>(:executing_tools)</i>"] -- "snapshot state" --> Context["context =<br/>%{agent_state: ...}"]
    Context -- "start tasks" --> Tasks["Tool Tasks<br/><i>read from context</i>"]

2. Command Calls & Casts

sequenceDiagram
    participant Caller
    participant Agent

    Caller->>Agent: Agent.prompt/2 (:gen_statem.call)
    Agent-->>Caller: %{queued: false}
    Note right of Agent: begins turn...

Used for: Agent.prompt/2 (call), Agent.abort/1 (cast)

Prompts are synchronous calls that return %{queued: boolean}. The caller observes progress through events (pattern 3). This keeps the UI responsive — critical for interactive CLI and web UIs.

3. Registry PubSub — Event Broadcasting

flowchart LR
    Agent["Agent broadcasts<br/>{:message_delta, ...}"] --> Registry["Opal.Events.Registry<br/><i>session_id key<br/>duplicate keys</i>"]
    Registry --> SubA["Subscriber A<br/><i>CLI</i>"]
    Registry --> SubB["Subscriber B<br/><i>UI</i>"]
    Registry --> SubC["Subscriber C<br/><i>test</i>"]

Used for: all agent lifecycle events

The Agent (and tool tasks) call Opal.Events.broadcast(session_id, event). Every process that called Opal.Events.subscribe(session_id) receives the event as a regular Erlang message:

{:opal_event, session_id, event}

Event types:

Event	Emitted when
`{:agent_start}`	Agent begins processing a prompt
`{:message_delta, %{delta: text}}`	Streaming text token from the LLM
`{:thinking_delta, %{delta: text}}`	Streaming thinking/reasoning token
`{:turn_end, message, _results}`	LLM turn complete
`{:tool_execution_start, name, call_id, args, meta}`	Tool begins executing
`{:tool_execution_end, name, call_id, result}`	Tool finished executing
`{:agent_end, messages, usage}`	Agent is done, returning to idle
`{:error, reason}`	Unrecoverable error occurred
`{:sub_agent_event, parent_call_id, sub_id, event}`	Forwarded event from a sub-agent

This is built on OTP's Registry — no external dependencies, no message broker, no serialization overhead. Events are plain Erlang terms sent via send/2 under the hood.

4. Sub-Agent Event Forwarding

Sub-agents broadcast events to their own session ID. The SubAgent tool subscribes to those events, collects the response, and re-broadcasts each event to the parent session tagged with the parent tool call ID and sub-agent ID:

sequenceDiagram
    participant SubAgent as Sub-Agent "sub-x1"
    participant Registry as Events Registry
    participant ToolTask as SubAgent Tool Task
    participant ParentReg as Parent Events Registry
    participant CLI as CLI Subscriber

    SubAgent->>Registry: broadcast(sub_id, event)
    Registry->>ToolTask: {:opal_event, "sub-x1", event}
    Note over ToolTask: re-broadcasts to parent
    ToolTask->>ParentReg: broadcast {:sub_agent_event, "call-42", "sub-x1", event}
    ParentReg->>CLI: {:sub_agent_event, "call-42", "sub-x1", {:message_delta, ...}}

This gives the parent session real-time observability into sub-agent activity without any direct process coupling. The CLI renders sub-agent events with a tree border (┌─ / │ / └─) to visually distinguish them from the parent.

Tool Execution

Tool calls are executed in parallel using Task.Supervisor.async_nolink — all tools in a batch are spawned concurrently, with results delivered via mailbox :info events, keeping the agent process non-blocking:

flowchart LR
    Agent["Agent<br/><i>:executing_tools</i>"] -- "async_nolink(tool) × N" --> TaskSup["Task.Supervisor<br/><i>per-session</i>"]
    TaskSup --> T1["Task: tool 1"]
    TaskSup --> T2["Task: tool 2"]
    TaskSup --> TN["Task: tool N"]
    T1 -- "handle_info {ref, result}" --> Agent
    T2 -- "handle_info {ref, result}" --> Agent
    TN -- "handle_info {ref, result}" --> Agent

Why async_nolink + handle_info?

async_stream_nolink — blocks the server loop until all tools finish. Prevents abort/steer during execution.
async_nolink — tasks are not linked, and results arrive as messages. The Agent stays responsive to abort, steer, and other messages throughout.

Why per-session Task.Supervisor?

Isolation: If session A's tool tasks are misbehaving, session B is unaffected.
Cleanup: Terminating the SessionServer automatically kills all running tool tasks for that session.
Observability: You can inspect Task.Supervisor.children(sup) to see what tools are currently running in a specific session.

Crash Recovery

When a tool task crashes, the Agent receives a {:DOWN, ref, :process, _pid, reason} message via handle_info. It converts this to an error tool result and continues:

def handle_info(
      {:DOWN, ref, :process, _pid, reason},
      %State{status: :executing_tools, pending_tool_tasks: tasks} = state
    ) when is_map_key(tasks, ref) do
  {_task, tc} = Map.fetch!(tasks, ref)
  Opal.Agent.ToolRunner.collect_result(ref, tc, {:error, "Tool crashed: #{inspect(reason)}"}, state)
  |> next()
end

This ensures the LLM always receives a tool_result message with the correct call_id — even if the tool crashed. Without this, the LLM API rejects the request with "tool call must have a tool call ID".

Sub-Agent Architecture

Spawning

Sub-agents are started under the session's DynamicSupervisor:

DynamicSupervisor.start_child(
  state.sub_agent_supervisor,   # per-session, not global
  {Opal.Agent, opts}
)

They inherit the parent's config, provider, and working directory by default. Any of these can be overridden — including the model (e.g., use a cheaper model for simple tasks).

Depth Enforcement

Sub-agents are limited to one level — no recursive spawning. This is enforced by simply excluding the Opal.Tool.SubAgent module from the sub-agent's tool list:

parent_tools = parent_state.tools -- [Opal.Tool.SubAgent]

No runtime depth counter needed. The sub-agent literally does not have the tool available, so the LLM cannot request it. Clean, declarative, zero overhead.

Tool Sharing

Sub-agents share the parent's Task.Supervisor for tool execution. This means:

Tool tasks from both the parent and sub-agents run under the same supervisor
Terminating the session cleans up all tool tasks (parent + sub-agents)
No need for sub-agents to have their own Task.Supervisor

Lifecycle

sequenceDiagram
    participant Parent as Parent Agent
    participant SubAgent as Sub-Agent

    Parent->>SubAgent: spawn_from_state(state, %{})
    SubAgent-->>Parent: {:ok, sub_pid}

    Parent->>SubAgent: Events.subscribe(sub_id)
    Parent->>SubAgent: Agent.prompt(sub_pid, text)

    loop Streaming & Tools
        SubAgent-->>Parent: {:opal_event, sub_id, ...}
        Note over Parent: forward to parent session
    end

    SubAgent-->>Parent: {:opal_event, sub_id, {:agent_end, _}}
    Parent->>SubAgent: SubAgent.stop(sub_pid)
    Note over SubAgent: ✗ terminated
    Note over Parent: return {:ok, result}

Failure Domains & Isolation

Session Isolation

Each session is a self-contained subtree. Failures in one session cannot propagate to another:

Failure	Impact
Tool task crashes	Error result to LLM, agent continues
Sub-agent crashes	Tool returns error, parent continues
Agent state machine crashes	SessionServer restarts it (`:rest_for_one`)
Task.Supervisor crashes	Agent restarts too (`:rest_for_one`)
Entire SessionServer crashes	Only that session is lost
`Events.Registry` crashes	All sessions lose pubsub temporarily

`:rest_for_one` Strategy

The SessionServer uses :rest_for_one — if a child crashes, all children started after it are restarted. The child order is:

flowchart TD
    A["1. Task.Supervisor"] -->|"if this crashes..."| B["2. DynamicSupervisor"]
    B -->|"...this restarts..."| C["3. Session (optional)"]
    C -->|"...this restarts..."| D["4. Agent"]

    style A fill:#f9f,stroke:#333
    style E fill:#bbf,stroke:#333

This guarantees the Agent never runs without a working Task.Supervisor. But if the Agent crashes, the supervisors and session store remain intact.

State Snapshot for Tools

Although the Agent is no longer blocked during tool execution (tools use async_nolink + handle_info), tools still receive a state snapshot in their execution context rather than calling back to the agent process. This avoids race conditions and keeps tool execution self-contained:

context = %{
  working_dir: state.working_dir,
  session_id: state.session_id,
  config: state.config,
  agent_pid: self(),          # for reference only, never call into it
  agent_state: state          # snapshot — tools read from this
}

Tools (including SubAgent) use context.agent_state instead of calling back to the Agent. The SubAgent tool uses spawn_from_state/2 (takes a state struct) rather than spawn/2 (takes a pid and calls get_state).

Design Rationale

Why per-session supervision trees?

Before: A single global Task.Supervisor handled all tool execution across all sessions. This had several problems:

No isolation between sessions
No way to cleanly shut down one session's tasks without affecting others
No way to inspect what a specific session is doing
Cleanup required manual tracking

After: Each session owns its entire process tree. stop_session/1 is a single DynamicSupervisor.terminate_child/2 call that cleanly shuts down everything.

Why Registry-based pubsub?

No external dependencies — built into OTP
No serialization — events are plain Erlang terms, delivered via send/2
Duplicate keys — multiple subscribers per session ID
Process-native — subscribers just use receive, no callback modules
Automatic cleanup — when a subscriber process dies, its registrations are removed

Why `async_nolink` + `handle_info` for tools?

Non-blocking execution — the Agent stays responsive during tool runs
Fault isolation — one crashing tool doesn't take down the agent
Parallel batches — tool calls in a turn are spawned concurrently
Abort support — in-flight tasks can be cancelled without blocking the loop

Why sub-agents share the parent's Task.Supervisor?

Simplicity — fewer processes to manage
Unified cleanup — one supervisor termination kills everything
Resource sharing — sub-agent tool tasks are supervised identically to parent tool tasks
No nesting complexity — sub-agents don't need their own SessionServer

Why depth-1 sub-agents only?

Predictability — recursive agent spawning can lead to unbounded resource consumption
Debuggability — a flat parent→child relationship is easy to observe and reason about
Cost control — each sub-agent gets its own LLM conversation, so costs multiply. One level is sufficient for task delegation patterns (e.g., "read these 5 files in parallel") without enabling runaway recursion

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Supervision & Message Passing

Supervision Tree

When sub-agents are active

Process Roles

`Opal.Registry`

`Opal.Events.Registry`

`Opal.RPC.Server` (optional)

`Opal.SessionSupervisor`

`Opal.SessionServer`

`Opal.Agent`

`Opal.Session`

Message Passing

1. Synchronous Calls — State Access

2. Command Calls & Casts

3. Registry PubSub — Event Broadcasting

4. Sub-Agent Event Forwarding

Tool Execution

Crash Recovery

Sub-Agent Architecture

Spawning

Depth Enforcement

Tool Sharing

Lifecycle

Failure Domains & Isolation

Session Isolation

`:rest_for_one` Strategy

State Snapshot for Tools

Design Rationale

Why per-session supervision trees?

Why Registry-based pubsub?

Why `async_nolink` + `handle_info` for tools?

Why sub-agents share the parent's Task.Supervisor?

Why depth-1 sub-agents only?

FilesExpand file tree

supervision.md

Latest commit

History

supervision.md

File metadata and controls

Supervision & Message Passing

Supervision Tree

When sub-agents are active

Process Roles

Opal.Registry

Opal.Events.Registry

Opal.RPC.Server (optional)

Opal.SessionSupervisor

Opal.SessionServer

Opal.Agent

Opal.Session

Message Passing

1. Synchronous Calls — State Access

2. Command Calls & Casts

3. Registry PubSub — Event Broadcasting

4. Sub-Agent Event Forwarding

Tool Execution

Crash Recovery

Sub-Agent Architecture

Spawning

Depth Enforcement

Tool Sharing

Lifecycle

Failure Domains & Isolation

Session Isolation

:rest_for_one Strategy

State Snapshot for Tools

Design Rationale

Why per-session supervision trees?

Why Registry-based pubsub?

Why async_nolink + handle_info for tools?

Why sub-agents share the parent's Task.Supervisor?

Why depth-1 sub-agents only?

`Opal.Registry`

`Opal.Events.Registry`

`Opal.RPC.Server` (optional)

`Opal.SessionSupervisor`

`Opal.SessionServer`

`Opal.Agent`

`Opal.Session`

`:rest_for_one` Strategy

Why `async_nolink` + `handle_info` for tools?