This document describes Opal's OTP supervision architecture, process lifecycle, message passing patterns, and the design rationale behind each decision.
Opal uses a per-session supervision tree so that every active session is a fully isolated unit โ its own processes, its own failure domain, its own cleanup.
graph TD
OpalSup["Opal.Supervisor<br/><i>:rest_for_one</i>"]
AgentRegistry["Opal.Registry<br/><i>:unique โ agent/session lookup</i>"]
Registry["Opal.Events.Registry<br/><i>:duplicate โ pubsub backbone</i>"]
SessionSup["Opal.SessionSupervisor<br/><i>DynamicSupervisor :one_for_one</i>"]
Stdio["Opal.RPC.Server<br/><i>optional, gated by :start_rpc</i>"]
OpalSup --> AgentRegistry
OpalSup --> Registry
OpalSup --> SessionSup
OpalSup -.-> Stdio
subgraph session_a["Session "a1b2c3""]
SSA["SessionServer<br/><i>Supervisor :rest_for_one</i>"]
TSA["Task.Supervisor<br/><i>tool execution</i>"]
DSA["DynamicSupervisor<br/><i>sub-agents</i>"]
SesA["Opal.Session<br/><i>persistence (optional)</i>"]
AgentA["Opal.Agent<br/><i>agent loop</i>"]
SSA --> TSA
SSA --> DSA
SSA --> SesA
SSA --> AgentA
end
subgraph session_b["Session "d4e5f6""]
SSB["SessionServer<br/><i>Supervisor :rest_for_one</i>"]
TSB["Task.Supervisor"]
DSB["DynamicSupervisor"]
AgentB["Opal.Agent"]
SSB --> TSB
SSB --> DSB
SSB --> AgentB
end
SessionSup --> SSA
SessionSup --> SSB
Sub-agents are regular Opal.Agent processes started under the session's own
DynamicSupervisor. They share the parent's Task.Supervisor for tool
execution but cannot spawn further sub-agents (depth = 1).
graph TD
SS["SessionServer "a1b2c3"<br/><i>:rest_for_one</i>"]
TS["Task.Supervisor<br/><i>shared by parent + sub-agents</i>"]
DS["DynamicSupervisor<br/><i>owns sub-agent processes</i>"]
Ses["Opal.Session<br/><i>conversation tree (optional)</i>"]
PA["Opal.Agent<br/><i>parent agent loop</i>"]
SubA["Opal.Agent "sub-x1y2z3"<br/><i>sub-agent A</i>"]
SubB["Opal.Agent "sub-p4q5r6"<br/><i>sub-agent B</i>"]
SS --> TS
SS --> DS
SS --> Ses
SS --> PA
DS --> SubA
DS --> SubB
A Registry with :unique keys used for looking up agent and session
processes by session ID. Processes register with keys like {:agent, session_id}
and {:session, session_id}, enabling direct lookup without walking supervisor
children.
A Registry with :duplicate keys. Any process can subscribe to a session ID
and receive events. This is the pubsub backbone โ everything else
is per-session. The registry never holds state; it simply routes messages.
The stdio JSON-RPC transport. Started by default but can be disabled by setting
config :opal, start_rpc: false โ useful when embedding the core library as an
SDK without stdio transport. See rpc.md for details.
A DynamicSupervisor that acts as the container for all active sessions. When
Opal.start_session/1 is called, a new SessionServer child is started here.
When Opal.stop_session/1 is called, the entire SessionServer subtree is
terminated โ one call cleans up the agent, all running tools, all sub-agents,
and the session store.
A per-session Supervisor using the :rest_for_one strategy. Children are
started in order:
Task.Supervisorโ executes tool calls as supervised tasksDynamicSupervisorโ manages sub-agent processesOpal.Sessionโ conversation persistence (optional, started whensession: true)Opal.Agentโ the agent loop
The :rest_for_one strategy means if the Task.Supervisor or
DynamicSupervisor crashes, the Agent (which depends on them) is restarted
too. But a crash in the Agent does not affect the supervisors above it.
Each child is registered via Opal.Registry for discoverability:
| Process | Registry key |
|---|---|
| Task.Supervisor | {:tool_sup, session_id} |
| DynamicSupervisor | {:sub_agent_sup, session_id} |
| Session | {:session, session_id} |
A :gen_statem process that implements the core agent loop:
- Receive a user prompt (
:call) - Stream an LLM response via the configured
Provider - If the LLM returns tool calls โ execute them via supervised tasks โ loop to step 2
- If the LLM returns text only โ broadcast
agent_endโ go idle
The Agent holds references to its session-local tool_supervisor and
sub_agent_supervisor in its state โ it never touches global process names.
A GenServer backed by an ETS table that stores conversation messages in a
tree structure (each message has a parent_id). Supports branching โ rewinding
to any past message and forking the conversation. Persistence is via DETS.
Opal uses three distinct message passing patterns, each chosen for a specific purpose.
sequenceDiagram
participant Caller
participant Agent
Caller->>Agent: Agent.get_state/1 (:gen_statem.call)
Agent-->>Caller: %Agent.State{}
Used for: Agent.get_state/1, Session.append/2, Session.get_path/1
These are synchronous server calls โ the caller blocks until the server replies. Used when the caller needs a consistent snapshot of state.
Key design decision: Tool tasks never call Agent.get_state(agent_pid)
during execution. Instead, the Agent snapshots its state into the tool
execution context before dispatching tasks:
flowchart LR
Agent["Agent<br/><i>(:executing_tools)</i>"] -- "snapshot state" --> Context["context =<br/>%{agent_state: ...}"]
Context -- "start tasks" --> Tasks["Tool Tasks<br/><i>read from context</i>"]
sequenceDiagram
participant Caller
participant Agent
Caller->>Agent: Agent.prompt/2 (:gen_statem.call)
Agent-->>Caller: %{queued: false}
Note right of Agent: begins turn...
Used for: Agent.prompt/2 (call), Agent.abort/1 (cast)
Prompts are synchronous calls that return %{queued: boolean}. The caller
observes progress through events (pattern 3). This keeps the UI responsive โ
critical for interactive CLI and web UIs.
flowchart LR
Agent["Agent broadcasts<br/>{:message_delta, ...}"] --> Registry["Opal.Events.Registry<br/><i>session_id key<br/>duplicate keys</i>"]
Registry --> SubA["Subscriber A<br/><i>CLI</i>"]
Registry --> SubB["Subscriber B<br/><i>UI</i>"]
Registry --> SubC["Subscriber C<br/><i>test</i>"]
Used for: all agent lifecycle events
The Agent (and tool tasks) call Opal.Events.broadcast(session_id, event).
Every process that called Opal.Events.subscribe(session_id) receives the
event as a regular Erlang message:
{:opal_event, session_id, event}Event types:
| Event | Emitted when |
|---|---|
{:agent_start} |
Agent begins processing a prompt |
{:message_delta, %{delta: text}} |
Streaming text token from the LLM |
{:thinking_delta, %{delta: text}} |
Streaming thinking/reasoning token |
{:turn_end, message, _results} |
LLM turn complete |
{:tool_execution_start, name, call_id, args, meta} |
Tool begins executing |
{:tool_execution_end, name, call_id, result} |
Tool finished executing |
{:agent_end, messages, usage} |
Agent is done, returning to idle |
{:error, reason} |
Unrecoverable error occurred |
{:sub_agent_event, parent_call_id, sub_id, event} |
Forwarded event from a sub-agent |
This is built on OTP's Registry โ no external dependencies, no message
broker, no serialization overhead. Events are plain Erlang terms sent via
send/2 under the hood.
Sub-agents broadcast events to their own session ID. The SubAgent tool
subscribes to those events, collects the response, and re-broadcasts each
event to the parent session tagged with the parent tool call ID and sub-agent ID:
sequenceDiagram
participant SubAgent as Sub-Agent "sub-x1"
participant Registry as Events Registry
participant ToolTask as SubAgent Tool Task
participant ParentReg as Parent Events Registry
participant CLI as CLI Subscriber
SubAgent->>Registry: broadcast(sub_id, event)
Registry->>ToolTask: {:opal_event, "sub-x1", event}
Note over ToolTask: re-broadcasts to parent
ToolTask->>ParentReg: broadcast {:sub_agent_event, "call-42", "sub-x1", event}
ParentReg->>CLI: {:sub_agent_event, "call-42", "sub-x1", {:message_delta, ...}}
This gives the parent session real-time observability into sub-agent activity without any direct process coupling. The CLI renders sub-agent events with a tree border (โโ / โ / โโ) to visually distinguish them from the parent.
Tool calls are executed in parallel using Task.Supervisor.async_nolink โ all tools
in a batch are spawned concurrently, with results delivered via mailbox :info events,
keeping the agent process non-blocking:
flowchart LR
Agent["Agent<br/><i>:executing_tools</i>"] -- "async_nolink(tool) ร N" --> TaskSup["Task.Supervisor<br/><i>per-session</i>"]
TaskSup --> T1["Task: tool 1"]
TaskSup --> T2["Task: tool 2"]
TaskSup --> TN["Task: tool N"]
T1 -- "handle_info {ref, result}" --> Agent
T2 -- "handle_info {ref, result}" --> Agent
TN -- "handle_info {ref, result}" --> Agent
Why async_nolink + handle_info?
async_stream_nolinkโ blocks the server loop until all tools finish. Prevents abort/steer during execution.async_nolinkโ tasks are not linked, and results arrive as messages. The Agent stays responsive to abort, steer, and other messages throughout.
Why per-session Task.Supervisor?
- Isolation: If session A's tool tasks are misbehaving, session B is unaffected.
- Cleanup: Terminating the
SessionServerautomatically kills all running tool tasks for that session. - Observability: You can inspect
Task.Supervisor.children(sup)to see what tools are currently running in a specific session.
When a tool task crashes, the Agent receives a {:DOWN, ref, :process, _pid, reason}
message via handle_info. It converts this to an error tool result and continues:
def handle_info(
{:DOWN, ref, :process, _pid, reason},
%State{status: :executing_tools, pending_tool_tasks: tasks} = state
) when is_map_key(tasks, ref) do
{_task, tc} = Map.fetch!(tasks, ref)
Opal.Agent.ToolRunner.collect_result(ref, tc, {:error, "Tool crashed: #{inspect(reason)}"}, state)
|> next()
endThis ensures the LLM always receives a tool_result message with the correct
call_id โ even if the tool crashed. Without this, the LLM API rejects the
request with "tool call must have a tool call ID".
Sub-agents are started under the session's DynamicSupervisor:
DynamicSupervisor.start_child(
state.sub_agent_supervisor, # per-session, not global
{Opal.Agent, opts}
)They inherit the parent's config, provider, and working directory by default. Any of these can be overridden โ including the model (e.g., use a cheaper model for simple tasks).
Sub-agents are limited to one level โ no recursive spawning. This is
enforced by simply excluding the Opal.Tool.SubAgent module from the
sub-agent's tool list:
parent_tools = parent_state.tools -- [Opal.Tool.SubAgent]No runtime depth counter needed. The sub-agent literally does not have the tool available, so the LLM cannot request it. Clean, declarative, zero overhead.
Sub-agents share the parent's Task.Supervisor for tool execution. This means:
- Tool tasks from both the parent and sub-agents run under the same supervisor
- Terminating the session cleans up all tool tasks (parent + sub-agents)
- No need for sub-agents to have their own
Task.Supervisor
sequenceDiagram
participant Parent as Parent Agent
participant SubAgent as Sub-Agent
Parent->>SubAgent: spawn_from_state(state, %{})
SubAgent-->>Parent: {:ok, sub_pid}
Parent->>SubAgent: Events.subscribe(sub_id)
Parent->>SubAgent: Agent.prompt(sub_pid, text)
loop Streaming & Tools
SubAgent-->>Parent: {:opal_event, sub_id, ...}
Note over Parent: forward to parent session
end
SubAgent-->>Parent: {:opal_event, sub_id, {:agent_end, _}}
Parent->>SubAgent: SubAgent.stop(sub_pid)
Note over SubAgent: โ terminated
Note over Parent: return {:ok, result}
Each session is a self-contained subtree. Failures in one session cannot propagate to another:
| Failure | Impact |
|---|---|
| Tool task crashes | Error result to LLM, agent continues |
| Sub-agent crashes | Tool returns error, parent continues |
| Agent state machine crashes | SessionServer restarts it (:rest_for_one) |
| Task.Supervisor crashes | Agent restarts too (:rest_for_one) |
| Entire SessionServer crashes | Only that session is lost |
Events.Registry crashes |
All sessions lose pubsub temporarily |
The SessionServer uses :rest_for_one โ if a child crashes, all children
started after it are restarted. The child order is:
flowchart TD
A["1. Task.Supervisor"] -->|"if this crashes..."| B["2. DynamicSupervisor"]
B -->|"...this restarts..."| C["3. Session (optional)"]
C -->|"...this restarts..."| D["4. Agent"]
style A fill:#f9f,stroke:#333
style E fill:#bbf,stroke:#333
This guarantees the Agent never runs without a working Task.Supervisor. But
if the Agent crashes, the supervisors and session store remain intact.
Although the Agent is no longer blocked during tool execution (tools use
async_nolink + handle_info), tools still receive a state snapshot
in their execution context rather than calling back to the agent process. This
avoids race conditions and keeps tool execution self-contained:
context = %{
working_dir: state.working_dir,
session_id: state.session_id,
config: state.config,
agent_pid: self(), # for reference only, never call into it
agent_state: state # snapshot โ tools read from this
}Tools (including SubAgent) use context.agent_state instead of calling back
to the Agent. The SubAgent tool uses spawn_from_state/2 (takes a state
struct) rather than spawn/2 (takes a pid and calls get_state).
Before: A single global Task.Supervisor handled all tool execution across
all sessions. This had several problems:
- No isolation between sessions
- No way to cleanly shut down one session's tasks without affecting others
- No way to inspect what a specific session is doing
- Cleanup required manual tracking
After: Each session owns its entire process tree. stop_session/1 is a
single DynamicSupervisor.terminate_child/2 call that cleanly shuts down
everything.
- No external dependencies โ built into OTP
- No serialization โ events are plain Erlang terms, delivered via
send/2 - Duplicate keys โ multiple subscribers per session ID
- Process-native โ subscribers just use
receive, no callback modules - Automatic cleanup โ when a subscriber process dies, its registrations are removed
- Non-blocking execution โ the Agent stays responsive during tool runs
- Fault isolation โ one crashing tool doesn't take down the agent
- Parallel batches โ tool calls in a turn are spawned concurrently
- Abort support โ in-flight tasks can be cancelled without blocking the loop
- Simplicity โ fewer processes to manage
- Unified cleanup โ one supervisor termination kills everything
- Resource sharing โ sub-agent tool tasks are supervised identically to parent tool tasks
- No nesting complexity โ sub-agents don't need their own SessionServer
- Predictability โ recursive agent spawning can lead to unbounded resource consumption
- Debuggability โ a flat parentโchild relationship is easy to observe and reason about
- Cost control โ each sub-agent gets its own LLM conversation, so costs multiply. One level is sufficient for task delegation patterns (e.g., "read these 5 files in parallel") without enabling runaway recursion