Skip to content

Latest commit

ย 

History

History
487 lines (365 loc) ยท 17.3 KB

File metadata and controls

487 lines (365 loc) ยท 17.3 KB

Supervision & Message Passing

This document describes Opal's OTP supervision architecture, process lifecycle, message passing patterns, and the design rationale behind each decision.


Supervision Tree

Opal uses a per-session supervision tree so that every active session is a fully isolated unit โ€” its own processes, its own failure domain, its own cleanup.

graph TD
    OpalSup["Opal.Supervisor<br/><i>:rest_for_one</i>"]
    AgentRegistry["Opal.Registry<br/><i>:unique โ€” agent/session lookup</i>"]
    Registry["Opal.Events.Registry<br/><i>:duplicate โ€” pubsub backbone</i>"]
    SessionSup["Opal.SessionSupervisor<br/><i>DynamicSupervisor :one_for_one</i>"]
    Stdio["Opal.RPC.Server<br/><i>optional, gated by :start_rpc</i>"]

    OpalSup --> AgentRegistry
    OpalSup --> Registry
    OpalSup --> SessionSup
    OpalSup -.-> Stdio

    subgraph session_a["Session &quot;a1b2c3&quot;"]
        SSA["SessionServer<br/><i>Supervisor :rest_for_one</i>"]
        TSA["Task.Supervisor<br/><i>tool execution</i>"]
        DSA["DynamicSupervisor<br/><i>sub-agents</i>"]
        SesA["Opal.Session<br/><i>persistence (optional)</i>"]
        AgentA["Opal.Agent<br/><i>agent loop</i>"]
        SSA --> TSA
        SSA --> DSA
        SSA --> SesA
        SSA --> AgentA
    end

    subgraph session_b["Session &quot;d4e5f6&quot;"]
        SSB["SessionServer<br/><i>Supervisor :rest_for_one</i>"]
        TSB["Task.Supervisor"]
        DSB["DynamicSupervisor"]
        AgentB["Opal.Agent"]
        SSB --> TSB
        SSB --> DSB
        SSB --> AgentB
    end

    SessionSup --> SSA
    SessionSup --> SSB
Loading

When sub-agents are active

Sub-agents are regular Opal.Agent processes started under the session's own DynamicSupervisor. They share the parent's Task.Supervisor for tool execution but cannot spawn further sub-agents (depth = 1).

graph TD
    SS["SessionServer &quot;a1b2c3&quot;<br/><i>:rest_for_one</i>"]
    TS["Task.Supervisor<br/><i>shared by parent + sub-agents</i>"]
    DS["DynamicSupervisor<br/><i>owns sub-agent processes</i>"]
    Ses["Opal.Session<br/><i>conversation tree (optional)</i>"]
    PA["Opal.Agent<br/><i>parent agent loop</i>"]
    SubA["Opal.Agent &quot;sub-x1y2z3&quot;<br/><i>sub-agent A</i>"]
    SubB["Opal.Agent &quot;sub-p4q5r6&quot;<br/><i>sub-agent B</i>"]

    SS --> TS
    SS --> DS
    SS --> Ses
    SS --> PA
    DS --> SubA
    DS --> SubB
Loading

Process Roles

Opal.Registry

A Registry with :unique keys used for looking up agent and session processes by session ID. Processes register with keys like {:agent, session_id} and {:session, session_id}, enabling direct lookup without walking supervisor children.

Opal.Events.Registry

A Registry with :duplicate keys. Any process can subscribe to a session ID and receive events. This is the pubsub backbone โ€” everything else is per-session. The registry never holds state; it simply routes messages.

Opal.RPC.Server (optional)

The stdio JSON-RPC transport. Started by default but can be disabled by setting config :opal, start_rpc: false โ€” useful when embedding the core library as an SDK without stdio transport. See rpc.md for details.

Opal.SessionSupervisor

A DynamicSupervisor that acts as the container for all active sessions. When Opal.start_session/1 is called, a new SessionServer child is started here. When Opal.stop_session/1 is called, the entire SessionServer subtree is terminated โ€” one call cleans up the agent, all running tools, all sub-agents, and the session store.

Opal.SessionServer

A per-session Supervisor using the :rest_for_one strategy. Children are started in order:

  1. Task.Supervisor โ€” executes tool calls as supervised tasks
  2. DynamicSupervisor โ€” manages sub-agent processes
  3. Opal.Session โ€” conversation persistence (optional, started when session: true)
  4. Opal.Agent โ€” the agent loop

The :rest_for_one strategy means if the Task.Supervisor or DynamicSupervisor crashes, the Agent (which depends on them) is restarted too. But a crash in the Agent does not affect the supervisors above it.

Each child is registered via Opal.Registry for discoverability:

Process Registry key
Task.Supervisor {:tool_sup, session_id}
DynamicSupervisor {:sub_agent_sup, session_id}
Session {:session, session_id}

Opal.Agent

A :gen_statem process that implements the core agent loop:

  1. Receive a user prompt (:call)
  2. Stream an LLM response via the configured Provider
  3. If the LLM returns tool calls โ†’ execute them via supervised tasks โ†’ loop to step 2
  4. If the LLM returns text only โ†’ broadcast agent_end โ†’ go idle

The Agent holds references to its session-local tool_supervisor and sub_agent_supervisor in its state โ€” it never touches global process names.

Opal.Session

A GenServer backed by an ETS table that stores conversation messages in a tree structure (each message has a parent_id). Supports branching โ€” rewinding to any past message and forking the conversation. Persistence is via DETS.


Message Passing

Opal uses three distinct message passing patterns, each chosen for a specific purpose.

1. Synchronous Calls โ€” State Access

sequenceDiagram
    participant Caller
    participant Agent

    Caller->>Agent: Agent.get_state/1 (:gen_statem.call)
    Agent-->>Caller: %Agent.State{}
Loading

Used for: Agent.get_state/1, Session.append/2, Session.get_path/1

These are synchronous server calls โ€” the caller blocks until the server replies. Used when the caller needs a consistent snapshot of state.

Key design decision: Tool tasks never call Agent.get_state(agent_pid) during execution. Instead, the Agent snapshots its state into the tool execution context before dispatching tasks:

flowchart LR
    Agent["Agent<br/><i>(:executing_tools)</i>"] -- "snapshot state" --> Context["context =<br/>%{agent_state: ...}"]
    Context -- "start tasks" --> Tasks["Tool Tasks<br/><i>read from context</i>"]
Loading

2. Command Calls & Casts

sequenceDiagram
    participant Caller
    participant Agent

    Caller->>Agent: Agent.prompt/2 (:gen_statem.call)
    Agent-->>Caller: %{queued: false}
    Note right of Agent: begins turn...
Loading

Used for: Agent.prompt/2 (call), Agent.abort/1 (cast)

Prompts are synchronous calls that return %{queued: boolean}. The caller observes progress through events (pattern 3). This keeps the UI responsive โ€” critical for interactive CLI and web UIs.

3. Registry PubSub โ€” Event Broadcasting

flowchart LR
    Agent["Agent broadcasts<br/>{:message_delta, ...}"] --> Registry["Opal.Events.Registry<br/><i>session_id key<br/>duplicate keys</i>"]
    Registry --> SubA["Subscriber A<br/><i>CLI</i>"]
    Registry --> SubB["Subscriber B<br/><i>UI</i>"]
    Registry --> SubC["Subscriber C<br/><i>test</i>"]
Loading

Used for: all agent lifecycle events

The Agent (and tool tasks) call Opal.Events.broadcast(session_id, event). Every process that called Opal.Events.subscribe(session_id) receives the event as a regular Erlang message:

{:opal_event, session_id, event}

Event types:

Event Emitted when
{:agent_start} Agent begins processing a prompt
{:message_delta, %{delta: text}} Streaming text token from the LLM
{:thinking_delta, %{delta: text}} Streaming thinking/reasoning token
{:turn_end, message, _results} LLM turn complete
{:tool_execution_start, name, call_id, args, meta} Tool begins executing
{:tool_execution_end, name, call_id, result} Tool finished executing
{:agent_end, messages, usage} Agent is done, returning to idle
{:error, reason} Unrecoverable error occurred
{:sub_agent_event, parent_call_id, sub_id, event} Forwarded event from a sub-agent

This is built on OTP's Registry โ€” no external dependencies, no message broker, no serialization overhead. Events are plain Erlang terms sent via send/2 under the hood.

4. Sub-Agent Event Forwarding

Sub-agents broadcast events to their own session ID. The SubAgent tool subscribes to those events, collects the response, and re-broadcasts each event to the parent session tagged with the parent tool call ID and sub-agent ID:

sequenceDiagram
    participant SubAgent as Sub-Agent "sub-x1"
    participant Registry as Events Registry
    participant ToolTask as SubAgent Tool Task
    participant ParentReg as Parent Events Registry
    participant CLI as CLI Subscriber

    SubAgent->>Registry: broadcast(sub_id, event)
    Registry->>ToolTask: {:opal_event, "sub-x1", event}
    Note over ToolTask: re-broadcasts to parent
    ToolTask->>ParentReg: broadcast {:sub_agent_event, "call-42", "sub-x1", event}
    ParentReg->>CLI: {:sub_agent_event, "call-42", "sub-x1", {:message_delta, ...}}
Loading

This gives the parent session real-time observability into sub-agent activity without any direct process coupling. The CLI renders sub-agent events with a tree border (โ”Œโ”€ / โ”‚ / โ””โ”€) to visually distinguish them from the parent.


Tool Execution

Tool calls are executed in parallel using Task.Supervisor.async_nolink โ€” all tools in a batch are spawned concurrently, with results delivered via mailbox :info events, keeping the agent process non-blocking:

flowchart LR
    Agent["Agent<br/><i>:executing_tools</i>"] -- "async_nolink(tool) ร— N" --> TaskSup["Task.Supervisor<br/><i>per-session</i>"]
    TaskSup --> T1["Task: tool 1"]
    TaskSup --> T2["Task: tool 2"]
    TaskSup --> TN["Task: tool N"]
    T1 -- "handle_info {ref, result}" --> Agent
    T2 -- "handle_info {ref, result}" --> Agent
    TN -- "handle_info {ref, result}" --> Agent
Loading

Why async_nolink + handle_info?

  • async_stream_nolink โ€” blocks the server loop until all tools finish. Prevents abort/steer during execution.
  • async_nolink โ€” tasks are not linked, and results arrive as messages. The Agent stays responsive to abort, steer, and other messages throughout.

Why per-session Task.Supervisor?

  • Isolation: If session A's tool tasks are misbehaving, session B is unaffected.
  • Cleanup: Terminating the SessionServer automatically kills all running tool tasks for that session.
  • Observability: You can inspect Task.Supervisor.children(sup) to see what tools are currently running in a specific session.

Crash Recovery

When a tool task crashes, the Agent receives a {:DOWN, ref, :process, _pid, reason} message via handle_info. It converts this to an error tool result and continues:

def handle_info(
      {:DOWN, ref, :process, _pid, reason},
      %State{status: :executing_tools, pending_tool_tasks: tasks} = state
    ) when is_map_key(tasks, ref) do
  {_task, tc} = Map.fetch!(tasks, ref)
  Opal.Agent.ToolRunner.collect_result(ref, tc, {:error, "Tool crashed: #{inspect(reason)}"}, state)
  |> next()
end

This ensures the LLM always receives a tool_result message with the correct call_id โ€” even if the tool crashed. Without this, the LLM API rejects the request with "tool call must have a tool call ID".


Sub-Agent Architecture

Spawning

Sub-agents are started under the session's DynamicSupervisor:

DynamicSupervisor.start_child(
  state.sub_agent_supervisor,   # per-session, not global
  {Opal.Agent, opts}
)

They inherit the parent's config, provider, and working directory by default. Any of these can be overridden โ€” including the model (e.g., use a cheaper model for simple tasks).

Depth Enforcement

Sub-agents are limited to one level โ€” no recursive spawning. This is enforced by simply excluding the Opal.Tool.SubAgent module from the sub-agent's tool list:

parent_tools = parent_state.tools -- [Opal.Tool.SubAgent]

No runtime depth counter needed. The sub-agent literally does not have the tool available, so the LLM cannot request it. Clean, declarative, zero overhead.

Tool Sharing

Sub-agents share the parent's Task.Supervisor for tool execution. This means:

  • Tool tasks from both the parent and sub-agents run under the same supervisor
  • Terminating the session cleans up all tool tasks (parent + sub-agents)
  • No need for sub-agents to have their own Task.Supervisor

Lifecycle

sequenceDiagram
    participant Parent as Parent Agent
    participant SubAgent as Sub-Agent

    Parent->>SubAgent: spawn_from_state(state, %{})
    SubAgent-->>Parent: {:ok, sub_pid}

    Parent->>SubAgent: Events.subscribe(sub_id)
    Parent->>SubAgent: Agent.prompt(sub_pid, text)

    loop Streaming & Tools
        SubAgent-->>Parent: {:opal_event, sub_id, ...}
        Note over Parent: forward to parent session
    end

    SubAgent-->>Parent: {:opal_event, sub_id, {:agent_end, _}}
    Parent->>SubAgent: SubAgent.stop(sub_pid)
    Note over SubAgent: โœ— terminated
    Note over Parent: return {:ok, result}
Loading

Failure Domains & Isolation

Session Isolation

Each session is a self-contained subtree. Failures in one session cannot propagate to another:

Failure Impact
Tool task crashes Error result to LLM, agent continues
Sub-agent crashes Tool returns error, parent continues
Agent state machine crashes SessionServer restarts it (:rest_for_one)
Task.Supervisor crashes Agent restarts too (:rest_for_one)
Entire SessionServer crashes Only that session is lost
Events.Registry crashes All sessions lose pubsub temporarily

:rest_for_one Strategy

The SessionServer uses :rest_for_one โ€” if a child crashes, all children started after it are restarted. The child order is:

flowchart TD
    A["1. Task.Supervisor"] -->|"if this crashes..."| B["2. DynamicSupervisor"]
    B -->|"...this restarts..."| C["3. Session (optional)"]
    C -->|"...this restarts..."| D["4. Agent"]

    style A fill:#f9f,stroke:#333
    style E fill:#bbf,stroke:#333
Loading

This guarantees the Agent never runs without a working Task.Supervisor. But if the Agent crashes, the supervisors and session store remain intact.

State Snapshot for Tools

Although the Agent is no longer blocked during tool execution (tools use async_nolink + handle_info), tools still receive a state snapshot in their execution context rather than calling back to the agent process. This avoids race conditions and keeps tool execution self-contained:

context = %{
  working_dir: state.working_dir,
  session_id: state.session_id,
  config: state.config,
  agent_pid: self(),          # for reference only, never call into it
  agent_state: state          # snapshot โ€” tools read from this
}

Tools (including SubAgent) use context.agent_state instead of calling back to the Agent. The SubAgent tool uses spawn_from_state/2 (takes a state struct) rather than spawn/2 (takes a pid and calls get_state).


Design Rationale

Why per-session supervision trees?

Before: A single global Task.Supervisor handled all tool execution across all sessions. This had several problems:

  • No isolation between sessions
  • No way to cleanly shut down one session's tasks without affecting others
  • No way to inspect what a specific session is doing
  • Cleanup required manual tracking

After: Each session owns its entire process tree. stop_session/1 is a single DynamicSupervisor.terminate_child/2 call that cleanly shuts down everything.

Why Registry-based pubsub?

  • No external dependencies โ€” built into OTP
  • No serialization โ€” events are plain Erlang terms, delivered via send/2
  • Duplicate keys โ€” multiple subscribers per session ID
  • Process-native โ€” subscribers just use receive, no callback modules
  • Automatic cleanup โ€” when a subscriber process dies, its registrations are removed

Why async_nolink + handle_info for tools?

  • Non-blocking execution โ€” the Agent stays responsive during tool runs
  • Fault isolation โ€” one crashing tool doesn't take down the agent
  • Parallel batches โ€” tool calls in a turn are spawned concurrently
  • Abort support โ€” in-flight tasks can be cancelled without blocking the loop

Why sub-agents share the parent's Task.Supervisor?

  • Simplicity โ€” fewer processes to manage
  • Unified cleanup โ€” one supervisor termination kills everything
  • Resource sharing โ€” sub-agent tool tasks are supervised identically to parent tool tasks
  • No nesting complexity โ€” sub-agents don't need their own SessionServer

Why depth-1 sub-agents only?

  • Predictability โ€” recursive agent spawning can lead to unbounded resource consumption
  • Debuggability โ€” a flat parentโ†’child relationship is easy to observe and reason about
  • Cost control โ€” each sub-agent gets its own LLM conversation, so costs multiply. One level is sufficient for task delegation patterns (e.g., "read these 5 files in parallel") without enabling runaway recursion