This document is a language-agnostic specification for building a coding agent -- an autonomous system that pairs a large language model with developer tools through an agentic loop. It is designed to be implementable from scratch by any developer or coding agent in any programming language.
This spec layers on top of the Unified LLM Client Specification, which handles all LLM communication. The agent loop uses the SDK's low-level Client.complete() and Client.stream() methods directly, implementing its own turn loop to interleave tool execution with truncation, steering, events, and loop detection.
- Overview and Goals
- Agentic Loop
- Provider-Aligned Toolsets
- Tool Execution Environment
- Tool Output and Context Management
- System Prompts and Environment Context
- Subagents
- Out of Scope (Nice-to-Haves)
- Definition of Done
A coding agent is a system that takes a natural language instruction ("fix the login bug", "add dark mode", "write tests for this module"), plans a solution, and executes it by reading files, editing code, running commands, and iterating until the task is done. The core challenge is orchestrating an agentic loop that coordinates LLM calls, tool execution, context management, and provider-specific behavior into a reliable autonomous workflow.
Each LLM provider's models are trained and optimized for specific tool interfaces and system prompts. GPT-5.2 and the GPT-5.2-codex series work best with the same tools and prompts as codex-rs. Gemini models work best with the same tools and prompts as gemini-cli. Anthropic models work best with the same tools and prompts as Claude Code. A good coding agent respects this reality rather than forcing a universal toolset on all models.
Tools like Claude Code, Codex CLI, and Gemini CLI exist as end-user CLIs. You can run them in non-interactive mode and pipe output, but that gives you a black box: text goes in, text comes out. You cannot programmatically inspect the conversation mid-run, inject steering messages between tool calls, swap the execution environment, change reasoning effort on the fly, observe individual tool calls as they happen, or compose agents into larger systems.
This spec defines a library -- a programmable agentic loop that a host application controls at every step. The host can:
- Submit input and observe every event as the agent thinks, calls tools, and produces output -- not after the fact, but in real time as it happens.
- Steer the agent mid-task by injecting messages between tool rounds, redirecting it without restarting.
- Change configuration on the fly -- reasoning effort, model, timeouts -- between any two turns.
- Swap where tools run by providing a different execution environment (local, Docker, Kubernetes, WASM, SSH) without changing any tool logic.
- Compose agents by spawning subagents for parallel work, each with their own history but sharing the same filesystem.
- Build on top of it -- CLIs, IDEs, web UIs, batch systems, CI pipelines, evaluation harnesses, and agent-to-agent coordination systems all consume the same library.
The fidelity of control is the point. Every coding agent CLI is built on an agentic loop internally; this spec makes that loop a first-class, programmable interface.
Programmable-first. The agent is a library, not a CLI. Every aspect of the loop -- tool execution, event delivery, steering, configuration -- is accessible programmatically. A CLI is one possible host application, not the primary interface.
Provider-aligned. Each model family works best with its native agent's tools and system prompts. The spec defines provider-specific tool profiles, not a single universal set. Start from the provider's native toolset and extend it.
Extensible execution. Tool execution is abstracted behind an ExecutionEnvironment interface. The default runs locally. Implementations can target Docker, Kubernetes, WASM, or any remote host. Changing where tools run should not require changing the tools themselves.
Event-driven. Every agent action emits a typed event for UI rendering, logging, and integration. The event stream is the primary interface for host applications.
Hackable. Reasonable defaults with override points everywhere -- timeouts, output sizes, tool sets, execution environments, system prompts, reasoning effort. The spec prescribes defaults, not ceilings.
Language-agnostic. All code is pseudocode. Data structures use neutral notation. No specific programming language is assumed.
+--------------------------------------------------+
| Host Application (CLI, IDE, Web UI) |
+--------------------------------------------------+
| ^
| submit(input) | events
v |
+--------------------------------------------------+
| Coding Agent Loop |
| +--------------------+ +---------------------+ |
| | Session | | Provider Profiles | |
| | - history | | - OpenAI (codex) | |
| | - steering queue | | - Anthropic (cc) | |
| | - event emitter | | - Gemini (cli) | |
| +--------------------+ +---------------------+ |
| +--------------------+ +---------------------+ |
| | Tool Registry | | Execution Env | |
| | - tool dispatch | | - local (default) | |
| | - truncation | | - docker | |
| | - validation | | - k8s / wasm / ssh | |
| +--------------------+ +---------------------+ |
+--------------------------------------------------+
|
v
+--------------------------------------------------+
| Unified LLM SDK (Client.complete / stream) |
+--------------------------------------------------+
|
v
+--------------------------------------------------+
| LLM Provider APIs |
| (OpenAI Responses / Anthropic Messages / Gemini) |
+--------------------------------------------------+
The agent loop does NOT use the Unified LLM SDK's generate() high-level function (which has its own tool loop). It uses the low-level Client.complete() and implements its own loop because it needs to interleave tool execution with output truncation, steering message injection, event emission, timeout enforcement, and loop detection -- concerns that the SDK's generic tool loop does not handle.
The following open-source projects solve related problems and are worth studying for anyone implementing this spec.
-
codex-rs (https://github.com/openai/codex/tree/main/codex-rs) -- Rust. OpenAI's coding agent. Demonstrates async turn-based loop, 15+ tools including
apply_patch(v4a diff format), output truncation with head/tail split (1 MiB cap), 10-second default command timeout, platform-specific sandboxing, sub-agent spawning, and environment variable filtering. -
pi-agent-core (https://github.com/badlogic/pi-mono/tree/main/packages/agent) -- TypeScript. Minimal agent core by @mariozechner. Demonstrates 4-tool minimalism (read, write, edit, bash), explicit
steer()andfollowUp()queues for mid-turn message injection, 15+ event types, configurable thinking levels, context transform hooks, and abort signal support. -
gemini-cli (https://github.com/google-gemini/gemini-cli) -- TypeScript. Google's CLI agent. Demonstrates ReAct loop, 18+ built-in tools including web search and web fetch, GEMINI.md for project-specific instructions, headless/non-interactive mode for automation, and multiple authentication methods.
This spec assumes the companion Unified LLM Client Specification is implemented. The agent loop imports and uses these types directly:
Client,Request,Response-- for LLM communicationMessage,ContentPart,Role-- for conversation historyTool,ToolCall,ToolResult-- for tool definitions and resultsStreamEvent-- for streaming responsesUsage-- for token trackingFinishReason-- for stop condition detection
The agent builds Request objects, calls Client.complete() or Client.stream(), processes the Response, executes any ToolCall objects through the execution environment, constructs ToolResult objects, appends them to the conversation, and loops.
The Session is the central orchestrator. It holds the conversation state, dispatches tool calls, manages the event stream, and enforces limits.
RECORD Session:
id : String -- UUID, assigned at creation
provider_profile : ProviderProfile -- tools + system prompt for the active model
execution_env : ExecutionEnvironment -- where tools run
history : List<Turn> -- ordered conversation turns
event_emitter : EventEmitter -- delivers events to host application
config : SessionConfig -- limits, timeouts, settings
state : SessionState -- current lifecycle state
llm_client : Client -- from the Unified LLM SDK
steering_queue : Queue<String> -- messages to inject between tool rounds
followup_queue : Queue<String> -- messages to process after current input completes
subagents : Map<String, SubAgent> -- active child agents
RECORD SessionConfig:
max_turns : Integer = 0 -- 0 = unlimited
max_tool_rounds_per_input : Integer = 0 -- 0 = unlimited, per user input, not per session
default_command_timeout_ms : Integer = 10000 -- 10 seconds
max_command_timeout_ms : Integer = 600000 -- 10 minutes
reasoning_effort : String | None -- "low", "medium", "high", or null
tool_output_limits : Map<String, Integer> -- per-tool char limits (see Section 5)
enable_loop_detection : Boolean = true
loop_detection_window : Integer = 10 -- consecutive identical calls before warning
max_subagent_depth : Integer = 1 -- max nesting level for subagents
ENUM SessionState:
IDLE -- waiting for user input
PROCESSING -- running the agentic loop
AWAITING_INPUT -- model asked the user a question
CLOSED -- session terminated (normal or error)
State transitions:
IDLE -> PROCESSING -- on submit()
PROCESSING -> PROCESSING -- tool loop continues
PROCESSING -> AWAITING_INPUT -- model asks user a question (no tool calls, open-ended)
PROCESSING -> IDLE -- natural completion or turn limit
PROCESSING -> CLOSED -- unrecoverable error
IDLE -> CLOSED -- explicit close()
any -> CLOSED -- abort signal
AWAITING_INPUT -> PROCESSING -- user provides answer
A Turn is a single entry in the conversation history.
RECORD UserTurn:
content : String
timestamp : Timestamp
RECORD AssistantTurn:
content : String -- text output
tool_calls : List<ToolCall> -- tool invocations requested by the model
reasoning : String | None -- thinking/reasoning text (if available)
usage : Usage -- token counts for this turn
response_id : String | None -- provider response ID
timestamp : Timestamp
RECORD ToolResultsTurn:
results : List<ToolResult> -- one per tool call
timestamp : Timestamp
RECORD SystemTurn:
content : String
timestamp : Timestamp
RECORD SteeringTurn:
content : String -- injected steering message
timestamp : Timestamp
This is the centerpiece of the spec. The loop runs until the model produces a text-only response (no tool calls), a limit is hit, or an abort signal fires.
FUNCTION process_input(session, user_input):
session.state = PROCESSING
session.history.APPEND(UserTurn(content = user_input))
session.emit(USER_INPUT, content = user_input)
-- Drain any pending steering messages before the first LLM call
drain_steering(session)
round_count = 0
LOOP:
-- 1. Check limits
IF session.config.max_tool_rounds_per_input > 0 AND round_count >= session.config.max_tool_rounds_per_input:
session.emit(TURN_LIMIT, round = round_count)
BREAK
IF session.config.max_turns > 0 AND count_turns(session) >= session.config.max_turns:
session.emit(TURN_LIMIT, total_turns = count_turns(session))
BREAK
IF session.abort_signaled:
BREAK
-- 2. Build LLM request using provider profile
system_prompt = session.provider_profile.build_system_prompt(
environment = session.execution_env,
project_docs = discover_project_docs(session.execution_env.working_directory())
)
messages = convert_history_to_messages(session.history)
tool_defs = session.provider_profile.tools()
request = Request(
model = session.provider_profile.model,
messages = [Message.system(system_prompt)] + messages,
tools = tool_defs,
tool_choice = "auto",
reasoning_effort = session.config.reasoning_effort,
provider = session.provider_profile.id,
provider_options = session.provider_profile.provider_options()
)
-- 3. Call LLM via Unified LLM SDK (single-shot, no SDK-level tool loop)
response = session.llm_client.complete(request)
-- 4. Record assistant turn
assistant_turn = AssistantTurn(
content = response.text,
tool_calls = response.tool_calls,
reasoning = response.reasoning,
usage = response.usage,
response_id = response.id
)
session.history.APPEND(assistant_turn)
session.emit(ASSISTANT_TEXT_END, text = response.text, reasoning = response.reasoning)
-- 5. If no tool calls, natural completion
IF response.tool_calls IS EMPTY:
BREAK
-- 6. Execute tool calls through the execution environment
round_count += 1
results = execute_tool_calls(session, response.tool_calls)
session.history.APPEND(ToolResultsTurn(results = results))
-- 7. Drain steering messages injected during tool execution
drain_steering(session)
-- 8. Loop detection
IF session.config.enable_loop_detection:
IF detect_loop(session.history, session.config.loop_detection_window):
warning = "Loop detected: the last " + session.config.loop_detection_window
+ " tool calls follow a repeating pattern. Try a different approach."
session.history.APPEND(SteeringTurn(content = warning))
session.emit(LOOP_DETECTION, message = warning)
END LOOP
-- Process follow-up messages if any are queued
IF session.followup_queue IS NOT EMPTY:
next_input = session.followup_queue.DEQUEUE()
process_input(session, next_input)
RETURN
session.state = IDLE
session.emit(SESSION_END)
FUNCTION drain_steering(session):
WHILE session.steering_queue IS NOT EMPTY:
msg = session.steering_queue.DEQUEUE()
session.history.APPEND(SteeringTurn(content = msg))
session.emit(STEERING_INJECTED, content = msg)
FUNCTION execute_tool_calls(session, tool_calls):
results = []
-- Execute tool calls (concurrently if profile supports parallel execution)
IF session.provider_profile.supports_parallel_tool_calls AND LENGTH(tool_calls) > 1:
results = AWAIT_ALL([
execute_single_tool(session, tc) FOR tc IN tool_calls
])
ELSE:
FOR EACH tc IN tool_calls:
result = execute_single_tool(session, tc)
results.APPEND(result)
RETURN results
FUNCTION execute_single_tool(session, tool_call):
session.emit(TOOL_CALL_START, tool_name = tool_call.name, call_id = tool_call.id)
-- Look up tool in registry
registered = session.provider_profile.tool_registry.get(tool_call.name)
IF registered IS None:
error_msg = "Unknown tool: " + tool_call.name
session.emit(TOOL_CALL_END, call_id = tool_call.id, error = error_msg)
RETURN ToolResult(tool_call_id = tool_call.id, content = error_msg, is_error = true)
-- Execute via execution environment
TRY:
raw_output = registered.execute(tool_call.arguments, session.execution_env)
-- Truncate output before sending to LLM (character-based first, then line-based)
truncated_output = truncate_tool_output(raw_output, tool_call.name, session.config)
-- Emit full output via event stream (not truncated)
session.emit(TOOL_CALL_END, call_id = tool_call.id, output = raw_output)
RETURN ToolResult(
tool_call_id = tool_call.id,
content = truncated_output,
is_error = false
)
CATCH error:
error_msg = "Tool error (" + tool_call.name + "): " + str(error)
session.emit(TOOL_CALL_END, call_id = tool_call.id, error = error_msg)
RETURN ToolResult(tool_call_id = tool_call.id, content = error_msg, is_error = true)
Steering allows the host application to inject messages into the conversation between tool rounds. This is how a user can redirect the agent mid-task without waiting for it to finish.
session.steer(message: String)
-- Queue a message to be injected after the current tool round completes.
-- The message becomes a SteeringTurn in the history, converted to a
-- user message for the LLM on the next call.
-- If the agent is idle, the message is delivered on the next submit().
session.follow_up(message: String)
-- Queue a message to be processed after the current input is fully handled
-- (model has produced a text-only response). Triggers a new processing cycle.
SteeringTurns are converted to user-role messages when building the LLM request. This means the model sees them as additional user instructions.
The reasoning_effort config controls how much reasoning/thinking the model does. It maps directly to the Unified LLM SDK's reasoning_effort field on the Request.
| Value | Effect |
|---|---|
| "low" | Minimal reasoning. Faster, cheaper. Good for simple tasks. |
| "medium" | Balanced reasoning. Default for most tasks. |
| "high" | Deep reasoning. Slower, more expensive. Good for complex tasks. |
| null | Provider default (no override). |
Changing reasoning_effort mid-session takes effect on the next LLM call. For OpenAI reasoning models (GPT-5.2 series), this controls the reasoning token budget. For Anthropic models with extended thinking, this maps to the thinking budget. For Gemini models with thinking, this maps to thinkingConfig.
The loop exits when any of these conditions is met:
- Natural completion. The model responds with text only (no tool calls). The model is done.
- Round limit.
max_tool_rounds_per_inputis reached. The agent stops and returns what it has. - Turn limit.
max_turnsacross the entire session is reached. - Abort signal. The host application signals cancellation. The current LLM stream is closed, running processes are killed, and the session transitions to CLOSED.
- Unrecoverable error. An authentication error, context overflow, or other non-retryable error. The session transitions to CLOSED.
Every agent action emits a typed event. Events are delivered via an async iterator (or language-appropriate equivalent) to the host application.
RECORD SessionEvent:
kind : EventKind
timestamp : Timestamp
session_id : String
data : Map<String, Any>
ENUM EventKind:
SESSION_START -- session created
SESSION_END -- session closed (includes final state)
USER_INPUT -- user submitted input
ASSISTANT_TEXT_START -- model began generating text
ASSISTANT_TEXT_DELTA -- incremental text token
ASSISTANT_TEXT_END -- model finished text (includes full text)
TOOL_CALL_START -- tool execution began (includes tool name, call ID)
TOOL_CALL_OUTPUT_DELTA -- incremental tool output (for streaming tools)
TOOL_CALL_END -- tool execution finished (includes FULL untruncated output)
STEERING_INJECTED -- a steering message was added to history
TURN_LIMIT -- a turn limit was hit
LOOP_DETECTION -- a loop pattern was detected
ERROR -- an error occurred
Key design decision: The TOOL_CALL_END event carries the FULL untruncated tool output. The LLM receives the truncated version. This means the host application (UI, logs) always has access to complete output even though the model sees an abbreviated version.
Track the signature of each tool call (name + arguments hash). If the last N calls (default: 10) contain a repeating pattern (e.g., the same 2-3 calls cycling), inject a warning as a SteeringTurn telling the model to try a different approach.
FUNCTION detect_loop(history, window_size) -> Boolean:
recent_calls = extract_tool_call_signatures(history, last = window_size)
IF LENGTH(recent_calls) < window_size: RETURN false
-- Check for repeating patterns of length 1, 2, or 3
FOR pattern_len IN [1, 2, 3]:
IF window_size % pattern_len != 0: CONTINUE
pattern = recent_calls[0..pattern_len]
all_match = true
FOR i FROM pattern_len TO window_size STEP pattern_len:
IF recent_calls[i..i+pattern_len] != pattern:
all_match = false
BREAK
IF all_match: RETURN true
RETURN false
Models are trained and optimized for specific tool interfaces. OpenAI's models are trained on codex-rs's apply_patch format and tool schemas. Anthropic's models are trained on Claude Code's old_string/new_string editing and tool schemas. Gemini models are trained on gemini-cli's tool set.
Using a provider's native tool format produces better results than forcing a universal format. The initial base for each provider should be a 1:1 copy of the provider's reference agent -- the exact same system prompt, the exact same tool definitions, byte for byte. Not a similar prompt. Not similar tools. The actual prompt and harness that the model was evaluated and optimized against. Then extend it with additional capabilities (like subagents). Do not make all providers conform to a single tool interface.
INTERFACE ProviderProfile:
id : String -- "openai", "anthropic", "gemini"
model : String -- model identifier (e.g., "gpt-5.2-codex")
tool_registry : ToolRegistry -- all tools available to this profile
FUNCTION build_system_prompt(environment, project_docs) -> String
FUNCTION tools() -> List<ToolDefinition>
FUNCTION provider_options() -> Map | None
-- Capability flags
supports_reasoning : Boolean
supports_streaming : Boolean
supports_parallel_tool_calls : Boolean
context_window_size : Integer
All profiles include these base tools. The parameter schemas and output formats may vary between profiles (to match the provider's native conventions), but the functionality is the same.
Reads a file's contents with line numbers.
TOOL read_file:
description: "Read a file from the filesystem. Returns line-numbered content."
parameters:
file_path : String (required) -- absolute path to the file
offset : Integer (optional) -- 1-based line number to start reading from
limit : Integer (optional) -- max lines to read (default: 2000)
returns: Line-numbered text content in "NNN | content" format
errors: File not found, permission denied, binary file
Behavior: Read the file, prepend line numbers, respect offset/limit. For image files, return the image data for multimodal models. For very large files without offset/limit, the tool output will be truncated by the truncation layer (Section 5).
Writes content to a file, creating it if it does not exist.
TOOL write_file:
description: "Write content to a file. Creates the file and parent directories if needed."
parameters:
file_path : String (required) -- absolute path
content : String (required) -- the full file content
returns: Confirmation message with bytes written
errors: Permission denied, disk full
Searches for an exact string in a file and replaces it. This is the native editing format for Anthropic models.
TOOL edit_file:
description: "Replace an exact string occurrence in a file."
parameters:
file_path : String (required)
old_string : String (required) -- exact text to find
new_string : String (required) -- replacement text
replace_all : Boolean (optional) -- replace all occurrences (default: false)
returns: Confirmation with number of replacements made
errors: File not found, old_string not found, old_string not unique (when replace_all=false)
Behavior: Exact string match. If old_string is not found exactly, the implementation may attempt fuzzy matching (whitespace normalization, Unicode equivalence) and report the match. If old_string matches multiple locations and replace_all is false, return an error asking the model to provide more context.
Executes a command in the system shell.
TOOL shell:
description: "Execute a shell command. Returns stdout, stderr, and exit code."
parameters:
command : String (required) -- the command to run
timeout_ms : Integer (optional) -- override default timeout
description : String (optional) -- human-readable description of what this does
returns: Command output (stdout + stderr), exit code, duration
errors: Timeout, permission denied, command not found
Behavior: Run in a new process group. Enforce timeout (default from SessionConfig, overridable per-call). On timeout: SIGTERM, wait 2 seconds, SIGKILL. Return collected output plus timeout message. Environment variable filtering applied (see Section 4).
Searches file contents by pattern.
TOOL grep:
description: "Search file contents using regex patterns."
parameters:
pattern : String (required) -- regex pattern
path : String (optional) -- directory or file to search (default: working dir)
glob_filter : String (optional) -- file pattern filter (e.g., "*.py")
case_insensitive: Boolean (optional) -- default: false
max_results : Integer (optional) -- default: 100
returns: Matching lines with file paths and line numbers
errors: Invalid regex, path not found
Finds files by name pattern.
TOOL glob:
description: "Find files matching a glob pattern."
parameters:
pattern : String (required) -- glob pattern (e.g., "**/*.ts")
path : String (optional) -- base directory (default: working dir)
returns: List of matching file paths, sorted by modification time (newest first)
errors: Invalid pattern, path not found
For GPT-5.2, GPT-5.2-codex, and other OpenAI models. Mirrors the codex-rs toolset.
Key difference: apply_patch replaces edit_file and write_file for file modifications. OpenAI models are specifically trained on this format and produce significantly better edits when using it.
Additional/modified tools beyond the shared core:
TOOL apply_patch:
description: "Apply code changes using the patch format. Supports creating, deleting,
and modifying files in a single operation."
parameters:
patch : String (required) -- the patch content in v4a format
returns: List of affected file paths and operations performed
errors: Parse error, file not found (for updates), verification failure
The patch format is defined in full in Appendix A.
Profile tool list for OpenAI:
read_file(same as shared core, maps to codex-rsread_file)apply_patch(replacesedit_fileandwrite_filefor modifications)write_file(kept for creating new files without patch overhead)shell(maps to codex-rsexec_command, 10s default timeout)grep(maps to codex-rsgrep_files)glob(maps to codex-rslist_dir)spawn_agent,send_input,wait,close_agent(subagent tools, Section 7)
System prompt: Should mirror the codex-rs system prompt structure. Cover identity, tool usage guidelines, the apply_patch format expectations, and coding best practices.
Provider options: The OpenAI profile should set reasoning.effort on the Responses API request when reasoning_effort is configured.
For Claude Opus 4.6, Opus 4.5, Sonnet 4.5, Haiku 4.5, and older Claude models. Mirrors the Claude Code toolset.
Key difference: edit_file with old_string/new_string is the native editing format. Anthropic models are specifically trained on this exact-match search-and-replace pattern. Do NOT use apply_patch with Anthropic models.
Profile tool list for Anthropic:
read_file(line-numbered output, offset/limit support)write_file(full file writes)edit_file(old_string/new_string -- this is the native format)shell(bash execution, 120s default timeout per Claude Code convention)grep(ripgrep-backed with output modes: content, files_with_matches, count)glob(file pattern matching sorted by mtime)- Subagent tools (maps to Claude Code's Task tool pattern, Section 7)
System prompt: Should mirror the Claude Code system prompt structure. Cover identity, tool selection guidance, the edit_file format (explain that old_string must be unique), file operation preferences (edit existing files over creating new ones), and coding best practices.
Provider options: The Anthropic profile should pass beta headers (e.g., for extended thinking, 1M context) via provider_options.anthropic.beta_headers.
For Gemini 3 Flash, Gemini 2.5 Pro/Flash, and other Gemini models. Mirrors the gemini-cli toolset.
Profile tool list for Gemini:
read_file/read_many_files(batch reading support)write_fileedit_file(search-and-replace style, matching gemini-cli conventions)shell(command execution, 10s default timeout)grep(ripgrep semantics)glob(file pattern matching)list_dir(directory listing with depth options)web_search(optional -- Gemini models have native grounding capabilities)web_fetch(optional -- fetch and extract content from URLs)- Subagent tools (Section 7)
System prompt: Should mirror the gemini-cli system prompt structure. Cover identity, tool usage, GEMINI.md conventions, and coding best practices.
Provider options: Gemini profile should configure safety settings and grounding via provider_options.gemini.
After a provider profile is loaded, additional tools can be registered:
profile = create_openai_profile(model = "gpt-5.2-codex")
-- Add a custom tool on top of the profile
profile.tool_registry.register(RegisteredTool(
definition = ToolDefinition(
name = "run_tests",
description = "Run the project's test suite",
parameters = { "type": "object", "properties": { "filter": { "type": "string" } } }
),
executor = run_tests_function
))
Name collisions are resolved by latest-wins: a custom tool with the same name as a profile tool overrides it.
RECORD ToolDefinition:
name : String -- unique identifier
description : String -- for the LLM
parameters : Dict -- JSON Schema (root must be "object")
RECORD RegisteredTool:
definition : ToolDefinition
executor : Function -- (arguments, execution_env) -> String
RECORD ToolRegistry:
_tools : Map<String, RegisteredTool>
register(tool) -- add or replace a tool
unregister(name) -- remove a tool
get(name) -> RegisteredTool | None
definitions() -> List<ToolDefinition>
names() -> List<String>
Tool execution pipeline:
1. LOOKUP -- find the RegisteredTool by name
2. VALIDATE -- parse and validate arguments against JSON Schema
3. EXECUTE -- call executor with (arguments, execution_env)
4. TRUNCATE -- apply output size limits (Section 5)
5. EMIT -- emit TOOL_CALL_END event with full output
6. RETURN -- return truncated output as ToolResult
All tool operations pass through an ExecutionEnvironment interface. This decouples tool logic from where it runs. The default runs locally. Swap in a different implementation to run the same tools in Docker, on a Kubernetes pod, over SSH, or in WASM.
INTERFACE ExecutionEnvironment:
-- File operations
read_file(path: String, offset: Integer | None, limit: Integer | None) -> String
write_file(path: String, content: String) -> void
file_exists(path: String) -> Boolean
list_directory(path: String, depth: Integer) -> List<DirEntry>
-- Command execution
exec_command(
command : String,
timeout_ms : Integer,
working_dir : String | None,
env_vars : Map<String, String> | None
) -> ExecResult
-- Search operations
grep(pattern: String, path: String, options: GrepOptions) -> String
glob(pattern: String, path: String) -> List<String>
-- Lifecycle
initialize() -> void
cleanup() -> void
-- Metadata
working_directory() -> String
platform() -> String -- "darwin", "linux", "windows", "wasm"
os_version() -> String
RECORD ExecResult:
stdout : String
stderr : String
exit_code : Integer
timed_out : Boolean
duration_ms : Integer
RECORD DirEntry:
name : String
is_dir : Boolean
size : Integer | None
The default. Runs everything on the local machine.
File operations: Direct filesystem access. Paths are resolved relative to working_directory().
Command execution:
- Spawn in a new process group for clean killability
- Use the platform's default shell (
/bin/bash -con Linux/macOS,cmd.exe /con Windows) - Enforce timeout: on timeout, send SIGTERM to the process group, wait 2 seconds, then SIGKILL
- Capture stdout and stderr separately, then combine for the result
- Record wall-clock duration
Environment variable filtering:
- By default, exclude variables matching:
*_API_KEY,*_SECRET,*_TOKEN,*_PASSWORD,*_CREDENTIAL(case-insensitive) - Always include:
PATH,HOME,USER,SHELL,LANG,TERM,TMPDIR, language-specific paths (GOPATH,CARGO_HOME,NVM_DIR, etc.) - Customizable via an env var policy: inherit all, inherit none (start clean), or inherit core only
Search operations: Use ripgrep for grep if available, fall back to language-native regex search. Use filesystem globbing for glob.
These are not required implementations. They demonstrate the extensibility of the interface.
DockerExecutionEnvironment:
-- Commands execute inside a Docker container
exec_command(cmd, ...) -> docker exec <container_id> sh -c <cmd>
-- File operations use volume mounts or docker cp
read_file(path) -> docker cp <container_id>:<path> - | read
write_file(path, content) -> pipe content | docker cp - <container_id>:<path>
KubernetesExecutionEnvironment:
-- Commands execute in a Kubernetes pod
exec_command(cmd, ...) -> kubectl exec <pod> -- sh -c <cmd>
-- File operations use kubectl cp
read_file(path) -> kubectl cp <pod>:<path> /dev/stdout
WASMExecutionEnvironment:
-- For in-browser or embedded use
-- File operations use an in-memory filesystem (e.g., memfs)
-- Command execution is limited or emulated via WASI
RemoteSSHExecutionEnvironment:
-- Commands execute over SSH
exec_command(cmd, ...) -> ssh <host> <cmd>
-- File operations use SCP/SFTP
read_file(path) -> sftp get <host>:<path>
Execution environments can be wrapped for cross-cutting concerns:
-- Logging wrapper
LoggingExecutionEnvironment(inner: ExecutionEnvironment):
exec_command(cmd, ...):
LOG("exec: " + cmd)
result = inner.exec_command(cmd, ...)
LOG("exit: " + result.exit_code + " in " + result.duration_ms + "ms")
RETURN result
-- Read-only wrapper (rejects all writes)
ReadOnlyExecutionEnvironment(inner: ExecutionEnvironment):
write_file(path, content):
RAISE "Write operations are disabled in read-only mode"
exec_command(cmd, ...):
-- Could analyze command for write intent, or allow all
RETURN inner.exec_command(cmd, ...)
When tool output exceeds the configured limit, it MUST be truncated before being sent to the LLM. The full output is always available via the event stream (TOOL_CALL_END event).
Truncation algorithm (head/tail split):
FUNCTION truncate_output(output: String, max_chars: Integer, mode: String) -> String:
IF LENGTH(output) <= max_chars:
RETURN output
IF mode == "head_tail":
half = max_chars / 2
removed = LENGTH(output) - max_chars
RETURN output[0..half]
+ "\n\n[WARNING: Tool output was truncated. "
+ removed + " characters were removed from the middle. "
+ "The full output is available in the event stream. "
+ "If you need to see specific parts, re-run the tool with more targeted parameters.]\n\n"
+ output[-half..]
IF mode == "tail":
removed = LENGTH(output) - max_chars
RETURN "[WARNING: Tool output was truncated. First "
+ removed + " characters were removed. "
+ "The full output is available in the event stream.]\n\n"
+ output[-max_chars..]
The truncation message explicitly tells the model that output was truncated, how much was removed, and where the full output lives. This prevents the model from making decisions based on incomplete information without knowing it is incomplete.
| Tool | Default Max (chars) | Truncation Mode | Rationale |
|---|---|---|---|
| read_file | 50,000 | head_tail | Keep beginning (imports/types) and end (recent code) |
| shell | 30,000 | head_tail | Beginning has startup info, end has results |
| grep | 20,000 | tail | Keep the most recent/relevant matches |
| glob | 20,000 | tail | Most recently modified files first |
| edit_file | 10,000 | tail | Confirmation output, usually short |
| apply_patch | 10,000 | tail | Patch results, usually short |
| write_file | 1,000 | tail | Confirmation, always short |
| spawn_agent | 20,000 | head_tail | Subagent results |
These defaults are overridable via SessionConfig.tool_output_limits.
Character-based truncation (Section 5.1) is the primary safeguard and MUST always run first. It handles every case including pathological ones like a 2-line CSV where each line is 10MB. Line-based truncation is a secondary readability pass that runs after character truncation.
The full pipeline for every tool output:
FUNCTION truncate_tool_output(output, tool_name, config) -> String:
max_chars = config.tool_output_limits.get(tool_name, DEFAULT_TOOL_LIMITS[tool_name])
-- Step 1: Character-based truncation (always runs, handles all size concerns)
result = truncate_output(output, max_chars, DEFAULT_TRUNCATION_MODES[tool_name])
-- Step 2: Line-based truncation (secondary, for readability)
max_lines = config.tool_line_limits.get(tool_name, DEFAULT_LINE_LIMITS[tool_name])
IF max_lines IS NOT None:
result = truncate_lines(result, max_lines)
RETURN result
Default line limits (applied after character truncation):
| Tool | Default Max Lines | Rationale |
|---|---|---|
| shell | 256 | Command output with many short lines |
| grep | 200 | Search results, one per line |
| glob | 500 | File listings, one path per line |
| read_file | None | Character limit is sufficient |
| edit_file | None | Character limit is sufficient |
Line-based truncation uses the same head/tail split:
FUNCTION truncate_lines(output: String, max_lines: Integer) -> String:
lines = SPLIT(output, "\n")
IF LENGTH(lines) <= max_lines:
RETURN output
head_count = max_lines / 2
tail_count = max_lines - head_count
omitted = LENGTH(lines) - head_count - tail_count
RETURN JOIN(lines[0..head_count], "\n")
+ "\n[... " + omitted + " lines omitted ...]\n"
+ JOIN(lines[-tail_count..], "\n")
Why character truncation must come first: A file could have 2 lines that are each 10MB. Line-based truncation would see "only 2 lines" and pass it through untouched, blowing up the context window. Character truncation catches this because it operates on raw size, not line count. Always truncate by size first, then by line count.
Every command execution has a default timeout. The model can override the timeout per-call via the shell tool's timeout_ms parameter.
| Setting | Default | Purpose |
|---|---|---|
| default_command_timeout_ms | 10,000 | Applied when timeout_ms is not set |
| max_command_timeout_ms | 600,000 | Upper bound (10 minutes) |
When a timeout fires:
- Send SIGTERM to the process group
- Wait 2 seconds for graceful shutdown
- Send SIGKILL if the process is still running
- Return collected output so far plus a timeout message
The timeout message sent to the LLM:
[ERROR: Command timed out after {X}ms. Partial output is shown above.
You can retry with a longer timeout by setting the timeout_ms parameter.]
The agent should track approximate token usage using the heuristic: 1 token ~ 4 characters. Emit a warning event when usage exceeds 80% of the provider profile's context_window_size.
This is informational only. The agent does NOT perform automatic compaction or summarization (that is out of scope for this spec). The host application can use this signal to implement its own context management strategy.
FUNCTION check_context_usage(session):
approx_tokens = total_chars_in_history(session.history) / 4
threshold = session.provider_profile.context_window_size * 0.8
IF approx_tokens > threshold:
session.emit(WARNING, message = "Context usage at ~"
+ ROUND(approx_tokens / session.provider_profile.context_window_size * 100)
+ "% of context window")
The system prompt is assembled from multiple layers, with later layers taking precedence:
final_system_prompt =
1. Provider-specific base instructions (from ProviderProfile)
+ 2. Environment context (platform, git, working dir, date, model info)
+ 3. Tool descriptions (from the active profile's tool set)
+ 4. Project-specific instructions (AGENTS.md, CLAUDE.md, GEMINI.md, etc.)
+ 5. User instructions override (appended last, highest priority)
Each profile supplies its own base prompt tuned for the model family. The base instructions should closely mirror the system prompts of the provider's native agent:
- OpenAI profile: Mirror codex-rs system prompt. Cover identity, tool usage (especially apply_patch conventions), coding best practices, error handling guidance.
- Anthropic profile: Mirror Claude Code system prompt. Cover identity, tool selection guidance (read before edit, edit over write), the edit_file format (old_string must be unique), file operation preferences.
- Gemini profile: Mirror gemini-cli system prompt. Cover identity, tool usage, GEMINI.md conventions, coding best practices.
The spec does NOT prescribe full system prompt text -- those are implementation details that change frequently. It specifies what topics the prompt must cover.
Include a structured block with runtime information:
<environment>
Working directory: {working_directory}
Is git repository: {true/false}
Git branch: {current_branch}
Platform: {darwin/linux/windows}
OS version: {os_version_string}
Today's date: {YYYY-MM-DD}
Model: {model_display_name}
Knowledge cutoff: {knowledge_cutoff_date}
</environment>
This block is generated at session start and included in every system prompt.
Snapshot at session start. Include:
- Current branch
- Short status (modified/untracked file count, not full diff)
- Recent commit messages (last 5-10)
The model can always run git status, git diff, etc. via the shell tool for current state. The snapshot provides initial orientation.
Walk from the git root (or working directory if not in a git repo) to the current working directory. Recognized instruction files:
| File Name | Convention |
|---|---|
AGENTS.md |
Universal |
CLAUDE.md |
Anthropic-aligned |
GEMINI.md |
Gemini-aligned |
.codex/instructions.md |
OpenAI-aligned |
Loading rules:
- Root-level files are loaded first
- Subdirectory files are appended (deeper = higher precedence)
- Total byte budget: 32KB. If exceeded, truncate with a marker: "[Project instructions truncated at 32KB]"
- Only load files matching the active provider profile (e.g., Anthropic profile loads AGENTS.md and CLAUDE.md, not GEMINI.md)
- AGENTS.md is always loaded regardless of provider
A subagent is a child session spawned by the parent to handle a scoped task. The subagent runs its own agentic loop with its own conversation history but shares the parent's execution environment (same filesystem, same working directory or a subdirectory). This enables parallel work and task decomposition.
TOOL spawn_agent:
description: "Spawn a subagent to handle a scoped task autonomously."
parameters:
task : String (required) -- natural language task description
working_dir : String (optional) -- subdirectory to scope the agent to
model : String (optional) -- model override (default: parent's model)
max_turns : Integer (optional) -- turn limit (default: 0, unlimited)
returns: Agent ID and initial status
TOOL send_input:
description: "Send a message to a running subagent."
parameters:
agent_id : String (required)
message : String (required)
returns: Acknowledgement
TOOL wait:
description: "Wait for a subagent to complete and return its result."
parameters:
agent_id : String (required)
returns: SubAgentResult (output text, success boolean, turns used)
TOOL close_agent:
description: "Terminate a subagent."
parameters:
agent_id : String (required)
returns: Final status
RECORD SubAgentHandle:
id : String
session : Session -- independent session with own history
status : "running" | "completed" | "failed"
RECORD SubAgentResult:
output : String -- final text output from the subagent
success : Boolean
turns_used : Integer
The subagent:
- Gets its own Session with independent conversation history
- Shares the parent's
ExecutionEnvironment(same filesystem) - Uses the parent's
ProviderProfile(or an overridden model) - Has its own turn limits (configurable, default: 0, unlimited)
- Cannot spawn sub-sub-agents (depth limiting, default max depth: 1, configurable via
max_subagent_depth)
- Parallel exploration: Spawn multiple agents to investigate different parts of the codebase simultaneously
- Focused refactoring: Scope an agent to a single module with a specific task
- Test execution: Spawn an agent to run and fix tests while the parent continues other work
- Alternative approaches: Spawn agents to try different solutions and pick the best one
The following features are intentionally excluded from this core spec. They are valuable extensions that can be added on top of the architecture defined here. The spec's design has natural extension points for each.
MCP (Model Context Protocol). An MCP client can extend the agent with tools from external servers (GitHub, databases, Slack, etc.). The tool registry supports registering MCP-discovered tools with namespaced names (e.g., github__create_pr). This is a natural extension but not a core requirement for a functional coding agent.
Skills / Custom Commands. Reusable prompt templates stored as markdown files with YAML frontmatter. Skills standardize common workflows (e.g., /commit, /review-pr) and can be loaded from project directories or user home. The system prompt layer has a natural insertion point for skill descriptions.
Sandbox / Security Policies. OS-level sandboxing (macOS Seatbelt, Linux Landlock/Seccomp, Windows restricted tokens) constrains file and network access. The ExecutionEnvironment abstraction provides a natural hook -- a SandboxedLocalExecutionEnvironment could wrap the default environment. For stronger isolation, use DockerExecutionEnvironment.
Compaction / Context Summarization. Automatic conversation history summarization when approaching context limits. This is a complex feature with significant tradeoffs (information loss, summarization cost, pinned turns). The context window awareness signal (Section 5.5) gives host applications the information they need to implement their own strategy.
Approval / Permission System. User approval gates for sensitive operations (file writes, shell commands, destructive actions). The tool execution pipeline (Section 3.8) has a natural extension point between VALIDATE and EXECUTE where an approval step can be inserted.
Read-Before-Write Guardrail. Tracking which files have been read and blocking writes to unread files. A heuristic safety net that can be implemented as a tool execution middleware wrapping the execution environment.
This section defines how to validate that an implementation of this spec is complete and correct. An implementation is done when every item is checked off.
- Session can be created with a ProviderProfile and ExecutionEnvironment
-
process_input()runs the agentic loop: LLM call -> tool execution -> loop until natural completion - Natural completion: model responds with text only (no tool calls) and the loop exits
- Round limits:
max_tool_rounds_per_inputstops the loop when reached - Session turn limits:
max_turnsstops the loop across all inputs - Abort signal: cancellation stops the loop, kills running processes, transitions to CLOSED
- Loop detection: consecutive identical tool call patterns trigger a warning SteeringTurn
- Multiple sequential inputs work: submit, wait for completion, submit again
- OpenAI profile provides codex-rs-aligned tools including
apply_patch(v4a format) - Anthropic profile provides Claude Code-aligned tools including
edit_file(old_string/new_string) - Gemini profile provides gemini-cli-aligned tools
- Each profile produces a provider-specific system prompt covering identity, tool usage, and coding guidance
- Custom tools can be registered on top of any profile
- Tool name collisions resolved: custom registration overrides profile defaults
- Tool calls are dispatched through the ToolRegistry
- Unknown tool calls return an error result to the LLM (not an exception)
- Tool argument JSON is parsed and validated against the tool's parameter schema
- Tool execution errors are caught and returned as error results (
is_error = true) - Parallel tool execution works when the profile's
supports_parallel_tool_callsis true
-
LocalExecutionEnvironmentimplements all file and command operations - Command timeout default is 10 seconds
- Command timeout is overridable per-call via the shell tool's
timeout_msparameter - Timed-out commands: process group receives SIGTERM, then SIGKILL after 2 seconds
- Environment variable filtering excludes sensitive variables (
*_API_KEY,*_SECRET, etc.) by default - The
ExecutionEnvironmentinterface is implementable by consumers for custom environments (Docker, K8s, WASM, SSH)
- Character-based truncation runs FIRST on all tool outputs (handles pathological cases like 10MB single-line CSVs)
- Line-based truncation runs SECOND where configured (shell: 256, grep: 200, glob: 500)
- Truncation inserts a visible marker:
[WARNING: Tool output was truncated. N characters removed...] - The full untruncated output is available via the
TOOL_CALL_ENDevent - Default character limits match the table in Section 5.2 (read_file: 50k, shell: 30k, grep: 20k, etc.)
- Both character and line limits are overridable via
SessionConfig
-
steer()queues a message that is injected after the current tool round -
follow_up()queues a message that is processed after the current input completes - Steering messages appear as SteeringTurn in the history
- SteeringTurns are converted to user-role messages for the LLM
-
reasoning_effortis passed through to the LLM SDK Request - Changing
reasoning_effortmid-session takes effect on the next LLM call - Valid values: "low", "medium", "high", null (provider default) (certain providers might have other options like
xhigh)
- System prompt includes provider-specific base instructions
- System prompt includes environment context (platform, git, working dir, date, model info)
- System prompt includes tool descriptions from the active profile
- Project documentation files (AGENTS.md + provider-specific files) are discovered and included
- User instruction overrides are appended last (highest priority)
- Only relevant project files are loaded (e.g., Anthropic profile loads CLAUDE.md, not GEMINI.md)
- Subagents can be spawned with a scoped task via the
spawn_agenttool - Subagents share the parent's execution environment (same filesystem)
- Subagents maintain independent conversation history
- Depth limiting prevents recursive spawning (default max depth: 1)
- Subagent results are returned to the parent as tool results
-
send_input,wait, andclose_agenttools work correctly
- All event kinds listed in Section 2.9 are emitted at the correct times
- Events are delivered via async iterator or language-appropriate equivalent
-
TOOL_CALL_ENDevents carry full untruncated tool output - Session lifecycle events (SESSION_START, SESSION_END) bracket the session
- Tool execution errors -> error result sent to LLM (model can recover)
- LLM API transient errors (429, 500-503) -> retry with backoff (handled by Unified LLM SDK layer)
- Authentication errors -> surface immediately, no retry, session transitions to CLOSED
- Context window overflow -> emit warning event (no automatic compaction)
- Graceful shutdown: abort signal -> cancel LLM stream -> kill running processes -> flush events -> emit SESSION_END
Run this validation matrix -- each cell must pass:
| Test Case | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| Simple file creation task | [ ] | [ ] | [ ] |
| Read file, then edit it | [ ] | [ ] | [ ] |
| Multi-file edit in one session | [ ] | [ ] | [ ] |
| Shell command execution | [ ] | [ ] | [ ] |
| Shell command timeout handling | [ ] | [ ] | [ ] |
| Grep + glob to find files | [ ] | [ ] | [ ] |
| Multi-step task (read -> analyze -> edit) | [ ] | [ ] | [ ] |
| Tool output truncation (large file) | [ ] | [ ] | [ ] |
| Parallel tool calls (if supported) | [ ] | [ ] | [ ] |
| Steering mid-task | [ ] | [ ] | [ ] |
| Reasoning effort change | [ ] | [ ] | [ ] |
| Subagent spawn and wait | [ ] | [ ] | [ ] |
| Loop detection triggers warning | [ ] | [ ] | [ ] |
| Error recovery (tool fails, model retries) | [ ] | [ ] | [ ] |
| Provider-specific editing format works | [ ] | [ ] | [ ] |
End-to-end test with real API keys:
FOR EACH profile IN [openai_profile, anthropic_profile, gemini_profile]:
env = LocalExecutionEnvironment(working_dir = temp_directory())
session = Session(profile, env)
-- 1. Simple file creation
session.submit("Create a file called hello.py that prints 'Hello World'")
ASSERT env.file_exists("hello.py")
ASSERT env.read_file("hello.py") CONTAINS "Hello"
-- 2. Read and edit
session.submit("Read hello.py and add a second print statement that says 'Goodbye'")
content = env.read_file("hello.py")
ASSERT content CONTAINS "Hello"
ASSERT content CONTAINS "Goodbye"
-- 3. Shell execution
session.submit("Run hello.py and show the output")
-- Verify the agent executed the command (check event stream for shell tool call)
-- 4. Truncation verification
env.write_file("big.txt", REPEAT("x", 100000))
session.submit("Read big.txt")
-- Verify TOOL_CALL_END event has full 100k chars
-- Verify the ToolResult sent to LLM has truncation marker
-- 5. Steering
session.submit("Create a Flask web application with multiple routes")
session.steer("Actually, just create a single /health endpoint for now")
-- Verify the agent adjusts its approach
-- 6. Subagent
session.submit("Spawn a subagent to write tests for hello.py, then review its output")
-- Verify subagent tool calls appear in the event stream
-- 7. Timeout handling
session.submit("Run 'sleep 30' with the default timeout")
-- Verify the command times out after 10s and the agent handles it gracefully
The apply_patch tool (used by the OpenAI profile) accepts patches in the v4a format. This format supports creating, deleting, updating, and renaming files in a single patch.
patch = "*** Begin Patch\n" operations "*** End Patch\n"
operations = (add_file | delete_file | update_file)*
add_file = "*** Add File: " path "\n" added_lines
delete_file = "*** Delete File: " path "\n"
update_file = "*** Update File: " path "\n" [move_line] hunks
move_line = "*** Move to: " new_path "\n"
added_lines = ("+" line "\n")*
hunks = hunk+
hunk = "@@ " [context_hint] "\n" hunk_lines
hunk_lines = (context_line | delete_line | add_line)+
context_line = " " line "\n" -- space prefix = unchanged line
delete_line = "-" line "\n" -- minus prefix = remove this line
add_line = "+" line "\n" -- plus prefix = add this line
eof_marker = "*** End of File\n" -- optional, marks end of last hunk
Add File: Creates a new file. All lines are prefixed with +.
*** Begin Patch
*** Add File: src/utils/helpers.py
+def greet(name):
+ return f"Hello, {name}!"
*** End Patch
Delete File: Removes a file entirely.
*** Begin Patch
*** Delete File: src/old_module.py
*** End Patch
Update File: Modifies an existing file using context-based hunks.
*** Begin Patch
*** Update File: src/main.py
@@ def main():
print("Hello")
- return 0
+ print("World")
+ return 1
*** End Patch
Update + Rename: Modify and rename in one operation.
*** Begin Patch
*** Update File: old_name.py
*** Move to: new_name.py
@@ import os
import sys
-import old_dep
+import new_dep
*** End Patch
The @@ line provides a context hint (typically a function signature or recognizable line near the change). The implementation uses this hint plus the context lines (space-prefixed) to locate the correct position in the file. Convention: show 3 lines of context above and below each change.
When exact matching fails, the implementation should attempt fuzzy matching (whitespace normalization, Unicode punctuation equivalence) before reporting an error.
A single Update File block can contain multiple @@ hunks:
*** Begin Patch
*** Update File: src/config.py
@@ DEFAULT_TIMEOUT = 30
-DEFAULT_TIMEOUT = 30
+DEFAULT_TIMEOUT = 60
@@ def load_config():
config = {}
- config["debug"] = False
+ config["debug"] = True
*** End Patch
Tool execution errors are caught by the agent and sent to the LLM as error results (is_error = true). This gives the model the opportunity to recover, retry, or try a different approach.
| Error Type | Example | Recovery |
|---|---|---|
| FileNotFound | read_file on nonexistent path | Model can search for correct path |
| EditConflict | old_string not found or not unique | Model can read file and retry |
| ShellExitError | Command returned nonzero exit code | Model can inspect output and fix |
| ShellTimeout | Command exceeded timeout_ms | Model can retry with longer timeout |
| PermissionDenied | Write to protected path | Model can choose different path |
| ValidationError | Invalid JSON arguments for tool | Model can fix the arguments |
| UnknownTool | Model called a tool not in the registry | Error result tells model the tool name is wrong |
These errors affect the session itself, not individual tool calls.
| Error Type | Retryable | Behavior |
|---|---|---|
| ProviderError (429) | Yes | Retry with backoff (handled by Unified LLM SDK) |
| ProviderError (500-503) | Yes | Retry with backoff (handled by Unified LLM SDK) |
| AuthenticationError | No | Surface immediately, session -> CLOSED |
| ContextLengthError | No | Emit warning, session -> CLOSED |
| NetworkError | Yes | Retry with backoff (handled by Unified LLM SDK) |
| TurnLimitExceeded | No | Emit TURN_LIMIT event, session -> IDLE |
When an abort signal fires or an unrecoverable error occurs:
1. Cancel any in-flight LLM stream
2. Send SIGTERM to all running command process groups
3. Wait 2 seconds
4. Send SIGKILL to any remaining processes
5. Flush pending events
6. Emit SESSION_END event with final state
7. Clean up subagents (close_agent on all active subagents)
8. Transition session to CLOSED
Why provider-aligned toolsets instead of a universal tool set? Models may be trained on specific tool formats. GPT-5.2-codex is trained on apply_patch; forcing it to use old_string/new_string editing produces worse results. Claude is trained on old_string/new_string; forcing it to use apply_patch produces worse results. The initial base for each provider profile should be the exact system prompt and tool harness from that provider's reference agent -- not a similar prompt, not similar tools, but a 1:1 byte-for-byte copy of the original prompt and tool definitions as the starting point. Then extend from there. Starting from the native toolset gives the best baseline experience because the model has been evaluated and optimized against exactly that harness.
Why an extensible execution environment instead of a fixed local implementation? A coding agent that can only run on the local machine is limited. By abstracting tool execution behind an interface, the same agent logic works in Docker (for sandboxing), in Kubernetes (for cloud execution), over SSH (for remote development), or in WASM (for browser-based agents). The abstraction costs almost nothing in complexity but opens up major deployment flexibility.
Why head/tail truncation instead of just truncating from the end? The beginning of a file (imports, type definitions, module docstring) or the beginning of command output (startup messages, headers) is often as important as the end (final results). Head/tail split keeps both. The explicit truncation marker tells the model exactly what happened so it can request specific parts if needed.
Why not automatic compaction? Compaction (summarizing conversation history to free context space) is complex, lossy, and implementation-specific. Different host applications have different requirements -- a CLI might want aggressive compaction, an IDE might want to restart the session, a batch system might want to fail. The spec provides context window awareness signals and leaves the strategy to the host application.
Why does the loop use Client.complete() instead of the SDK's generate()? The SDK's generate() function has its own tool execution loop, but it does not handle: output truncation with explicit markers, steering message injection between tool rounds, event emission for UI rendering, per-tool timeout enforcement, loop detection, or execution environment abstraction. The agent loop needs all of these, so it manages its own loop using the lower-level Client.complete().
Why 10-second default command timeout? This matches codex-rs. Most developer commands (compile, lint, test a single file, git operations) complete in under 10 seconds. Long-running commands (full test suites, builds) should be explicitly requested with a longer timeout. The default protects against runaway processes without being so short that normal operations fail.
Why exclude sensitive environment variables by default? API keys, secrets, and tokens in the environment should not be visible to the LLM (which might include them in responses or log them). The default excludes *_API_KEY, *_SECRET, *_TOKEN, *_PASSWORD patterns. This is a safety default, not a security boundary -- the agent can still run commands that access these variables through the shell's own environment if needed.