Unified LLM Client Specification

This document is a consolidated, language-agnostic specification for building a unified client library that provides a single interface across multiple LLM providers (OpenAI, Anthropic, Google Gemini, and others). It is designed to be implementable from scratch by any developer or coding agent in any programming language.

Overview and Goals
Architecture
Data Model
Generation and Streaming
Tool Calling
Error Handling and Retry
Provider Adapter Contract
Definition of Done

1. Overview and Goals

1.1 Problem Statement

Applications that use large language models face a fragmented ecosystem. Each provider -- OpenAI, Anthropic, Google Gemini, and others -- exposes a different HTTP API with different message formats, tool calling conventions, streaming protocols, error shapes, and authentication mechanisms. Switching providers or supporting multiple providers requires rewriting request construction, response parsing, error handling, and streaming logic.

This specification defines a unified client library that solves this problem. Developers write provider-agnostic code and switch models by changing a single string identifier. No rewiring, no adapter-specific imports.

1.2 Design Principles

Provider-agnostic. Application code should not contain provider-specific logic. The unified interface handles all translation. Provider-specific features are available through an explicit escape hatch, not through leaky abstractions.

Minimal surface area. The library exposes a small number of types and functions. A developer can learn the full API in under an hour. Fewer concepts means fewer bugs and easier maintenance.

Streaming-first. Streaming is a first-class operation, not a flag on a blocking call. The two generation modes -- blocking and streaming -- have separate methods with distinct return types. This makes the type system work for the developer.

Composable. Cross-cutting concerns (logging, retries, caching) are handled through middleware, not baked into the core. The core client is a thin routing layer.

Escape hatches over false abstractions. When a provider offers a unique feature that does not map to the unified model, the library provides a pass-through mechanism rather than pretending the feature does not exist or building an unreliable shim.

1.3 Reference Open-Source Projects

The following open-source projects solve related problems and are worth studying for patterns, trade-offs, and lessons learned. They are not dependencies; implementors may take inspiration from any combination of them.

Vercel AI SDK (https://github.com/vercel/ai) -- TypeScript. Multi-provider architecture with a versioned provider specification. Clean separation between provider interfaces and high-level convenience API (generateText/streamText/generateObject). Demonstrates the start/delta/end streaming event pattern and a composable middleware system.
LiteLLM (https://github.com/BerriAI/litellm) -- Python. Supports 100+ providers behind a single completion() interface. Demonstrates the value of a unified calling convention and the model string routing pattern. Shows how to handle the long tail of provider-specific quirks at scale.
pi-ai (https://github.com/badlogic/pi-mono/tree/main/packages/ai) -- TypeScript. A multi-provider AI client from @mariozechner's pi-mono project. Demonstrates cost tracking, usage aggregation, and a clean provider adapter pattern with explicit reasoning token support.

2. Architecture

2.1 Four-Layer Architecture

The library is organized into four layers, each with a clear responsibility boundary.

Layer 4: High-Level API         generate(), stream(), generate_object()
          ---------------------------------------------------------------
Layer 3: Core Client            Client, provider routing, middleware hooks
          ---------------------------------------------------------------
Layer 2: Provider Utilities     Shared helpers for building adapters
          ---------------------------------------------------------------
Layer 1: Provider Specification  ProviderAdapter interface, shared types

Layer 1 -- Provider Specification. Defines the contract that every provider adapter must implement. Contains only interface definitions and shared type definitions. No implementation logic. This layer is the stability contract: it changes rarely and only with explicit versioning. A new provider is added by implementing this interface, not by modifying it.

Layer 2 -- Provider Utilities. Contains shared code for building adapters: HTTP client helpers, Server-Sent Events (SSE) parsing, retry logic, response normalization utilities, JSON schema translation helpers. Provider adapter authors import this layer; application developers generally do not.

Layer 3 -- Core Client. The main orchestration layer. The Client object holds registered provider adapters, routes requests by provider identifier, applies middleware, and manages configuration. This is the primary import for application code that wants direct control over requests.

Layer 4 -- High-Level API. Provides convenience functions (generate(), stream(), generate_object()) that wrap the Client with ergonomic defaults. Most application code uses this layer. These functions handle prompt standardization, tool execution loops, output parsing, structured output validation, and automatic retries.

2.2 Client Configuration

Environment-Based Setup

The recommended setup for most applications reads standard environment variables per provider:

client = Client.from_env()

Environment variable conventions:

Provider	Required Variable	Optional Variables
OpenAI	OPENAI_API_KEY	OPENAI_BASE_URL, OPENAI_ORG_ID, OPENAI_PROJECT_ID
Anthropic	ANTHROPIC_API_KEY	ANTHROPIC_BASE_URL
Gemini	GEMINI_API_KEY	GEMINI_BASE_URL

Alternate key names may be accepted (e.g., GOOGLE_API_KEY as a fallback for GEMINI_API_KEY). Only providers whose keys are present in the environment are registered. The first registered provider becomes the default.

Programmatic Setup

For full control, adapters are constructed explicitly and registered with the Client:

adapter = OpenAIAdapter(
    api_key = "sk-...",
    base_url = "https://custom-endpoint.example.com/v1",
    default_headers = { "X-Custom": "value" },
    timeout = 30.0
)

client = Client(
    providers = { "openai": adapter },
    default_provider = "openai"
)

Provider Resolution

When a request specifies a provider field, the Client routes to that adapter. When the provider field is omitted, the Client uses default_provider. If no default is set and no provider is specified, the Client raises a configuration error. The Client never guesses.

Model String Convention

Model identifiers are the provider's native string (e.g., "gpt-5.2", "claude-opus-4-6", "gemini-3-flash-preview"). The library does not invent its own model namespace. This avoids the maintenance burden of mapping tables and ensures new models work immediately without library updates. If a model string could be ambiguous (multiple providers support it), the provider field on the request disambiguates.

2.3 Middleware / Interceptor Pattern

The Client supports middleware for cross-cutting concerns. Middleware wraps provider calls and can inspect or modify requests, inspect or modify responses, and perform side effects.

FUNCTION logging_middleware(request, next):
    LOG("Request to " + request.provider + "/" + request.model)
    response = next(request)
    LOG("Response: " + response.usage.total_tokens + " tokens")
    RETURN response

client = Client(
    providers = { ... },
    middleware = [logging_middleware]
)

Execution order. Middleware runs in registration order for the request phase (first registered = first to execute) and in reverse order for the response phase. This is the standard onion/chain-of-responsibility pattern.

Streaming middleware. Middleware must also apply to streaming requests. For streaming, middleware wraps the event iterator and can observe or transform individual stream events. The middleware interface should support both modes:

FUNCTION streaming_middleware(request, next):
    event_iterator = next(request)
    FOR EACH event IN event_iterator:
        log_event(event)
        YIELD event

Common middleware use cases:

Logging
Request/response caching
Cost tracking and budgets
Client-side rate limiting
Prompt injection detection
Circuit breaker pattern

2.4 Provider Adapter Interface

Every provider must implement this interface:

INTERFACE ProviderAdapter:
    PROPERTY name : String             -- e.g., "openai", "anthropic", "gemini"

    FUNCTION complete(request: Request) -> Response
        -- Send a request, block until the model finishes, return the full response.

    FUNCTION stream(request: Request) -> AsyncIterator<StreamEvent>
        -- Send a request, return an asynchronous iterator of stream events.

Why two methods, not one. A single method with a stream: boolean flag was rejected because the return types are fundamentally different. A blocking Response and an asynchronous event stream have different consumption patterns, error handling models, and lifetime semantics. Separate methods make the type system work for the developer.

No separate send_tool_outputs method. Tool results are sent by including them in the message history of a new complete() or stream() call. This matches how Anthropic and Gemini work natively. The OpenAI adapter handles any translation internally.

Optional Adapter Methods

These methods are recommended but not required:

FUNCTION close() -> Void
    -- Release resources (HTTP connections, etc.). Called by Client.close().

FUNCTION initialize() -> Void
    -- Validate configuration on startup. Called by Client on registration.

FUNCTION supports_tool_choice(mode: String) -> Boolean
    -- Query whether a particular tool choice mode is supported.

2.5 Module-Level Default Client

High-level functions (generate(), stream(), etc.) use a module-level default client. This client is lazily initialized from environment variables on first use. Applications can override it:

set_default_client(my_client)

-- Or pass explicitly per call:
result = generate(model = "...", prompt = "...", client = my_client)

2.6 Concurrency Model

The library is async-first. All provider calls are non-blocking. The complete() and stream() methods are asynchronous. The high-level API provides both async and sync wrappers for languages that support both paradigms.

Multiple concurrent requests to different providers (or the same provider) are safe. The Client holds no mutable state between requests. Provider adapters manage their own connection pools and must be safe for concurrent use.

2.7 Native API Usage (Critical)

Each provider adapter MUST use the provider's native, preferred API -- not a compatibility layer. This is a fundamental design requirement. Using a lowest-common-denominator compatibility layer (such as only targeting the OpenAI Chat Completions API shape) loses access to provider-specific capabilities like reasoning tokens, extended thinking, prompt caching, and advanced tool features.

Provider	Required API	Why Not Compatibility Layer
OpenAI	Responses API (`/v1/responses`)	The Responses API properly surfaces reasoning tokens, supports built-in tools (web search, file search, code interpreter), and is OpenAI's forward-looking API. The Chat Completions API does not return reasoning tokens for reasoning models (GPT-5.2 series, etc.) and lacks server-side conversation state.
Anthropic	Messages API (`/v1/messages`)	The Messages API supports extended thinking with thinking blocks and signatures, prompt caching with `cache_control`, beta feature headers, and the strict user/assistant alternation model. There is no alternative.
Gemini	Gemini API (`/v1beta/models/*/generateContent`)	The native Gemini API supports grounding with Google Search, code execution, system instructions, and cached content. OpenAI-compatible endpoints for Gemini are limited shims.

The unified SDK abstracts over these different APIs so that callers write provider-agnostic code, but internally each adapter speaks the provider's native protocol. This is the entire value proposition: the complexity of three different APIs is handled once in the adapters so that downstream consumers (like a coding agent) never have to think about it.

2.8 Provider Beta Headers and Feature Flags

Providers frequently gate new features behind beta headers or feature flags. The unified SDK must support passing these through cleanly.

Anthropic beta headers. Anthropic uses the anthropic-beta header to enable features like:

max-tokens-3-5-sonnet-2025-04-14 -- enables 1M token context for certain models
interleaved-thinking-2025-05-14 -- enables interleaved thinking blocks
token-efficient-tools-2025-02-19 -- more efficient tool token usage
prompt-caching-2024-07-31 -- enables prompt caching

These must be passed as HTTP headers on the request. The adapter should accept them via provider_options:

request = Request(
    model = "claude-opus-4-6",
    messages = [ ... ],
    provider_options = {
        "anthropic": {
            "beta_headers": ["interleaved-thinking-2025-05-14"]
        }
    }
)

The Anthropic adapter joins these into a comma-separated anthropic-beta header value.

OpenAI feature flags. The Responses API supports enabling built-in tools and features via the request body (e.g., tools: [{"type": "web_search_preview"}]). These should be supported through provider_options or by extending the tool definitions.

Gemini configuration. Gemini supports safety settings, grounding configuration, and cached content references as part of the request body. These should be passable through provider_options.

The key principle: the unified interface handles the common 90% of cases. The provider_options escape hatch handles the remaining 10% without requiring library changes for every new provider feature.

2.9 Model Catalog

The SDK should ship with a catalog of known models to help consumers (especially AI coding agents) select valid model identifiers without guessing or hallucinating model names. The catalog is advisory, not restrictive -- unknown model strings are still passed through to the provider.

RECORD ModelInfo:
    id              : String            -- the model's API identifier (e.g., "claude-opus-4-6")
    provider        : String            -- which provider serves this model
    display_name    : String            -- human-readable name (e.g., "Claude Opus 4.6")
    context_window  : Integer           -- maximum total tokens (input + output)
    max_output      : Integer | None    -- maximum output tokens
    supports_tools  : Boolean           -- whether the model supports tool calling
    supports_vision : Boolean           -- whether the model accepts image inputs
    supports_reasoning : Boolean        -- whether the model produces reasoning tokens
    input_cost_per_million  : Float | None  -- cost per 1M input tokens (USD)
    output_cost_per_million : Float | None  -- cost per 1M output tokens (USD)
    aliases         : List<String>      -- shorthand names (e.g., ["sonnet", "claude-sonnet"])

At the time of writing (February 2026), the top models available through each provider's API are:

Provider	Top Model(s)
Anthropic	Claude Opus 4.6, Claude Sonnet 4.5
OpenAI	GPT-5.2 series (GPT-5.2, GPT-5.2-codex)
Gemini	Gemini 3 Pro (Preview), Gemini 3 Flash (Preview)

Implementations should default to the latest available models when no model is specified by the caller, and should prefer newer models in any model selection logic. However, the catalog must also include older models that are still served by the APIs, as callers may need them for cost, latency, or compatibility reasons.

Example catalog (keep this updated as new models release):

MODELS = [
    -- ==========================================================
    -- Anthropic -- prefer Claude Opus 4.6 for top quality
    -- ==========================================================

    ModelInfo(id="claude-opus-4-6",               provider="anthropic", display_name="Claude Opus 4.6",   context_window=200000, supports_tools=true, supports_vision=true, supports_reasoning=true),
    ModelInfo(id="claude-sonnet-4-5",             provider="anthropic", display_name="Claude Sonnet 4.5", context_window=200000, supports_tools=true, supports_vision=true, supports_reasoning=true),

    -- ==========================================================
    -- OpenAI -- prefer GPT-5.2 series for top quality
    -- ==========================================================

    ModelInfo(id="gpt-5.2",                       provider="openai",    display_name="GPT-5.2",           context_window=1047576, supports_tools=true, supports_vision=true, supports_reasoning=true),
    ModelInfo(id="gpt-5.2-mini",                  provider="openai",    display_name="GPT-5.2 Mini",      context_window=1047576, supports_tools=true, supports_vision=true, supports_reasoning=true),
    ModelInfo(id="gpt-5.2-codex",                 provider="openai",    display_name="GPT-5.2 Codex",     context_window=1047576, supports_tools=true, supports_vision=true, supports_reasoning=true),

    -- ==========================================================
    -- Gemini -- prefer Gemini 3 Flash Preview for latest
    -- ==========================================================

    ModelInfo(id="gemini-3-pro-preview",          provider="gemini",    display_name="Gemini 3 Pro (Preview)",   context_window=1048576, supports_tools=true, supports_vision=true, supports_reasoning=true),
    ModelInfo(id="gemini-3-flash-preview",        provider="gemini",    display_name="Gemini 3 Flash (Preview)", context_window=1048576, supports_tools=true, supports_vision=true, supports_reasoning=true),
]

Lookup functions:

get_model_info(model_id: String) -> ModelInfo | None
    -- Returns the catalog entry for a model, or None if unknown.

list_models(provider: String | None) -> List<ModelInfo>
    -- Returns all known models, optionally filtered by provider.

get_latest_model(provider: String, capability: String | None) -> ModelInfo | None
    -- Returns the newest/best model for a provider, optionally filtered by capability
    -- (e.g., "reasoning", "vision", "tools"). Useful for coding agents that want
    -- to always use the latest available model.

Why a catalog matters for coding agents: When an AI coding agent builds on top of this SDK, it needs to select models by capability (e.g., "pick a model that supports vision" or "pick the cheapest model that supports tools"). Without a catalog, the agent must hallucinate model identifiers from its training data, which go stale as providers release new models. The catalog gives the agent a reliable, up-to-date source of truth.

The catalog should be shipped as a data file (JSON or similar) that can be updated independently of the library code. Consider auto-generating it from provider documentation or APIs. When in doubt, prefer the latest models -- they are generally more capable, and the SDK should make it easy to stay current.

2.10 Prompt Caching (Critical for Cost)

Prompt caching allows providers to reuse computation from previous requests when the prefix of the conversation is unchanged. For agentic workloads where the system prompt and conversation history are identical across many turns, caching can reduce input token costs by 50-90%. The unified SDK MUST support caching for each provider.

Provider	Caching Behavior	SDK Action Required
OpenAI	Automatic -- the Responses API caches shared prefixes server-side	None. Use the Responses API and report `cache_read_tokens` from usage.
Gemini	Automatic -- prefix caching for repeated content, plus explicit `cachedContent` API for long contexts	None for automatic. Expose explicit caching via `provider_options`.
Anthropic	Not automatic. Requires explicit `cache_control` annotations on content blocks.	The Anthropic adapter must inject `cache_control` breakpoints automatically for agentic workloads.

Anthropic is the only provider where the SDK must do extra work. Without cache_control annotations, every turn re-processes the entire system prompt and conversation history at full price. With proper caching, cached input tokens cost 90% less. This is the single highest-ROI optimization for agentic workloads.

All three providers report cache statistics. The SDK must map these to Usage.cache_read_tokens and Usage.cache_write_tokens so callers can verify caching is working.

3. Data Model

This section defines all types used by the library. The notation uses a language-neutral struct/record style. Field types use these conventions:

String -- text
Integer -- whole number
Float -- decimal number
Boolean -- true/false
Bytes -- raw binary data
Dict -- key-value map
List<T> -- ordered collection of T
T | None -- optional (nullable) value
T | U -- union / either type

3.1 Message

The fundamental unit of conversation. A conversation is an ordered List<Message>.

RECORD Message:
    role          : Role                  -- who produced this message
    content       : List<ContentPart>     -- the message body (multimodal)
    name          : String | None         -- for tool messages and developer attribution
    tool_call_id  : String | None         -- links a tool-result message to its tool call

Convenience Constructors

For common cases, factory methods create properly structured Message objects:

Message.system("You are a helpful assistant.")
Message.user("What is 2 + 2?")
Message.assistant("The answer is 4.")
Message.tool_result(tool_call_id = "call_123", content = "72F and sunny", is_error = false)

Text Accessor

A convenience property on Message that concatenates text from all text content parts:

message.text -> String
    -- Returns the concatenation of all ContentPart entries where kind == TEXT.
    -- Returns empty string if no text parts exist.

3.2 Role

Five roles cover the semantics of all major providers:

ENUM Role:
    SYSTEM       -- High-level instructions shaping model behavior. Typically first.
    USER         -- Human input. Text, images, audio, documents.
    ASSISTANT    -- Model output. Text, tool calls, thinking blocks.
    TOOL         -- Tool execution results, linked by tool_call_id.
    DEVELOPER    -- Privileged instructions from the application (not the end user).

Provider mapping for roles:

SDK Role	OpenAI	Anthropic	Gemini
SYSTEM	`system` role	Extracted to `system` parameter	`systemInstruction`
USER	`user` role	`user` role	`user` role
ASSISTANT	`assistant` role	`assistant` role	`model` role
TOOL	`tool` role	`tool_result` block in user msg	`functionResponse` in user
DEVELOPER	`developer` role	Merged with system	Merged with system

3.3 ContentPart (Tagged Union)

Each message contains a list of ContentPart objects. Using a list rather than a single string enables multimodal messages (text interleaved with images), structured assistant responses (text interleaved with tool calls and thinking blocks), and tool results that include images.

ContentPart uses a tagged-union pattern: the kind field determines which data field is populated.

RECORD ContentPart:
    kind          : ContentKind | String  -- discriminator tag
    text          : String | None         -- populated when kind == TEXT
    image         : ImageData | None      -- populated when kind == IMAGE
    audio         : AudioData | None      -- populated when kind == AUDIO
    document      : DocumentData | None   -- populated when kind == DOCUMENT
    tool_call     : ToolCallData | None   -- populated when kind == TOOL_CALL
    tool_result   : ToolResultData | None -- populated when kind == TOOL_RESULT
    thinking      : ThinkingData | None   -- populated when kind == THINKING or REDACTED_THINKING

Note: The kind field accepts both the enum and arbitrary strings. This allows extension for provider-specific content kinds without modifying the core enum.

3.4 ContentKind

ENUM ContentKind:
    TEXT                -- Plain text. The most common kind.
    IMAGE               -- Image as URL, base64, or file reference.
    AUDIO               -- Audio as URL or raw bytes with media type.
    DOCUMENT            -- Document (PDF, etc.) as URL, base64, or file reference.
    TOOL_CALL           -- A model-initiated tool invocation.
    TOOL_RESULT         -- The result of executing a tool call.
    THINKING            -- Model reasoning/thinking content.
    REDACTED_THINKING   -- Redacted reasoning (Anthropic). Opaque, must round-trip verbatim.

Direction constraints:

Kind	May appear in roles
TEXT	SYSTEM, USER, ASSISTANT, DEVELOPER, TOOL
IMAGE	USER (input), ASSISTANT (generated)
AUDIO	USER (input)
DOCUMENT	USER (input)
TOOL_CALL	ASSISTANT (output)
TOOL_RESULT	TOOL (response)
THINKING	ASSISTANT (output)
REDACTED_THINKING	ASSISTANT (output)

3.5 Content Data Structures

ImageData

RECORD ImageData:
    url         : String | None     -- URL pointing to the image
    data        : Bytes | None      -- raw image bytes
    media_type  : String | None     -- MIME type, e.g. "image/png", "image/jpeg"
    detail      : String | None     -- processing fidelity hint: "auto", "low", "high"

Exactly one of url or data must be provided. The adapter base64-encodes data if the provider requires it. media_type defaults to "image/png" when data is provided and no type is specified.

Image upload is critical for multimodal capabilities. Many models (Claude, GPT-4.1, Gemini) accept image inputs for analysis, code screenshot reading, diagram understanding, and more. The SDK must handle image upload correctly across all providers:

Concern	OpenAI	Anthropic	Gemini
URL images	`image_url.url` field	`source.type = "url"` with `url` field	`fileData.fileUri` field
Base64 images	`image_url.url` as data URI (`data:mime;base64,...`)	`source.type = "base64"` with `data` + `media_type`	`inlineData` with `data` + `mimeType`
File path (local)	Read file, base64-encode, send as data URI	Read file, base64-encode, send as base64 source	Read file, base64-encode, send as inlineData
Supported formats	PNG, JPEG, GIF, WEBP	PNG, JPEG, GIF, WEBP	PNG, JPEG, GIF, WEBP, HEIC, HEIF
Max image size	20MB	~5MB per image (base64 encoded)	Varies by method
Detail/fidelity hint	`detail`: "auto", "low", "high"	Not supported (ignore)	Not supported (ignore)

Convenience: file path support. The SDK should accept a local file path as a convenience. When url looks like a local file path (starts with /, ./, or ~), the adapter reads the file, infers the MIME type from the extension, base64-encodes the contents, and sends it using the provider's inline data format. This makes it easy for coding agents to send screenshots and diagrams without manual encoding.

AudioData

RECORD AudioData:
    url         : String | None
    data        : Bytes | None
    media_type  : String | None     -- e.g. "audio/wav", "audio/mp3"

DocumentData

RECORD DocumentData:
    url         : String | None
    data        : Bytes | None
    media_type  : String | None     -- e.g. "application/pdf"
    file_name   : String | None     -- optional display name

ToolCallData

RECORD ToolCallData:
    id          : String            -- unique identifier for this call (provider-assigned)
    name        : String            -- tool name
    arguments   : Dict | String     -- parsed JSON arguments or raw argument string
    type        : String            -- "function" (default) or "custom"

The id field is assigned by the provider and is required for linking tool results back to calls. For providers that do not assign unique IDs (e.g., Gemini), the adapter must generate synthetic unique IDs (e.g., "call_" + random_uuid()) and maintain a mapping to the function name.

ToolResultData

RECORD ToolResultData:
    tool_call_id    : String            -- the ToolCallData.id this result answers
    content         : String | Dict     -- the tool's output (text or structured)
    is_error        : Boolean           -- whether the tool execution failed
    image_data      : Bytes | None      -- optional image result
    image_media_type: String | None     -- MIME type for the image result

When is_error is true, the model understands the tool failed and can adjust its approach.

ThinkingData

RECORD ThinkingData:
    text        : String            -- the thinking/reasoning content
    signature   : String | None     -- provider-specific signature for round-tripping
    redacted    : Boolean           -- true if this is redacted thinking (opaque content)

Thinking blocks from Anthropic's extended thinking must be preserved exactly as received and included in subsequent messages. The signature field enables this. Redacted thinking blocks contain opaque data that cannot be read but must be passed back verbatim.

Cross-provider portability: Thinking blocks with signatures are only valid when continuing with the same provider and model. When switching providers, the adapter should strip signatures and optionally convert the thinking text to a user-visible context message.

3.6 Request

The single input type for both complete() and stream():

RECORD Request:
    model             : String                      -- required; provider's native model ID
    messages          : List<Message>               -- required; the conversation
    provider          : String | None               -- optional; uses default if omitted
    tools             : List<ToolDefinition> | None -- optional
    tool_choice       : ToolChoice | None           -- optional; defaults to AUTO if tools present
    response_format   : ResponseFormat | None       -- optional; text, json, or json_schema
    temperature       : Float | None
    top_p             : Float | None
    max_tokens        : Integer | None
    stop_sequences    : List<String> | None
    reasoning_effort  : String | None               -- "none", "low", "medium", "high"
    metadata          : Dict<String, String> | None -- arbitrary key-value pairs
    provider_options  : Dict | None                 -- escape hatch for provider-specific params

Provider Options (Escape Hatch)

The provider_options field passes through provider-specific parameters that the unified interface does not model. Each adapter extracts the options it understands and ignores the rest.

request = Request(
    model = "claude-opus-4-6",
    messages = [ ... ],
    provider_options = {
        "anthropic": {
            "thinking": { "type": "enabled", "budget_tokens": 10000 },
            "beta_features": ["interleaved-thinking-2025-05-14"]
        }
    }
)

Code that uses provider_options is explicitly not portable. The library documents this tradeoff.

3.7 Response

RECORD Response:
    id              : String                -- provider-assigned response ID
    model           : String                -- actual model used (may differ from requested)
    provider        : String                -- which provider fulfilled the request
    message         : Message               -- the assistant's response as a Message
    finish_reason   : FinishReason          -- why generation stopped
    usage           : Usage                 -- token counts
    raw             : Dict | None           -- raw provider response JSON (for debugging)
    warnings        : List<Warning>         -- non-fatal issues (optional, may be empty)
    rate_limit      : RateLimitInfo | None  -- rate limit metadata from headers (optional)

Convenience accessors on Response:

response.text        -> String              -- concatenated text from all text parts
response.tool_calls  -> List<ToolCall>      -- extracted tool calls from the message
response.reasoning   -> String | None       -- concatenated reasoning/thinking text

3.8 FinishReason

A dual representation preserving both portable semantics and provider-specific detail:

RECORD FinishReason:
    reason  : String        -- unified: one of the values below
    raw     : String | None -- the provider's native finish reason string

Unified reason values:

Value	Meaning
`stop`	Natural end of generation (model stopped)
`length`	Output reached max_tokens limit
`tool_calls`	Model wants to invoke one or more tools
`content_filter`	Response blocked by safety/content filter
`error`	An error occurred during generation
`other`	Provider-specific reason not mapped above

Provider finish reason mapping:

Provider	Provider Value	Unified Value
OpenAI	stop	stop
OpenAI	length	length
OpenAI	tool_calls	tool_calls
OpenAI	content_filter	content_filter
Anthropic	end_turn	stop
Anthropic	stop_sequence	stop
Anthropic	max_tokens	length
Anthropic	tool_use	tool_calls
Gemini	STOP	stop
Gemini	MAX_TOKENS	length
Gemini	SAFETY	content_filter
Gemini	RECITATION	content_filter
Gemini	(has tool calls)	tool_calls

Note: Gemini does not have a dedicated "tool_calls" finish reason. The adapter infers it from the presence of functionCall parts in the response.

3.9 Usage

RECORD Usage:
    input_tokens        : Integer           -- tokens in the prompt
    output_tokens       : Integer           -- tokens generated by the model
    total_tokens        : Integer           -- input + output
    reasoning_tokens    : Integer | None    -- tokens used for chain-of-thought reasoning
    cache_read_tokens   : Integer | None    -- tokens served from prompt cache
    cache_write_tokens  : Integer | None    -- tokens written to prompt cache
    raw                 : Dict | None       -- raw provider usage data

Usage objects must support addition for aggregating across multi-step operations:

usage_a + usage_b -> Usage
    -- Sums integer fields.
    -- For optional fields: if either side is non-None, sum them (treating None as 0).
    -- If both sides are None for an optional field, the result is None.

Provider usage field mapping:

SDK Field	OpenAI Field	Anthropic Field	Gemini Field
input_tokens	usage.prompt_tokens	usage.input_tokens	usageMetadata.promptTokenCount
output_tokens	usage.completion_tokens	usage.output_tokens	usageMetadata.candidatesTokenCount
reasoning_tokens	usage.completion_tokens_details.reasoning_tokens	(see note below)	usageMetadata.thoughtsTokenCount
cache_read_tokens	usage.prompt_tokens_details.cached_tokens	usage.cache_read_input_tokens	usageMetadata.cachedContentTokenCount
cache_write_tokens	(not provided)	usage.cache_creation_input_tokens	(not provided)

Reasoning Token Handling (Critical)

Reasoning tokens are tokens the model uses for internal chain-of-thought before producing visible output. Properly tracking and surfacing reasoning tokens is essential for cost management and debugging, because reasoning tokens are billed as output tokens but are not visible in the response text.

OpenAI reasoning models (GPT-5.2 series, etc.):

The Responses API (/v1/responses) is REQUIRED for reasoning models. The Chat Completions API does not return reasoning token breakdowns for these models. The Responses API returns usage.output_tokens_details.reasoning_tokens which tells you exactly how many tokens were spent on reasoning vs. visible output.
The reasoning_effort request parameter ("low", "medium", "high") controls how much reasoning the model does. This maps to reasoning.effort in the Responses API request body.
Reasoning content is not visible in the response (OpenAI does not expose the thinking text for GPT-5.2 series models). The adapter should still populate reasoning_tokens in Usage so callers can track costs.

Anthropic extended thinking (Claude with thinking enabled):

Extended thinking is enabled via the thinking parameter (through provider_options) and requires specific beta headers.
Anthropic surfaces thinking as explicit thinking content blocks in the response. These blocks contain the actual reasoning text and count toward output_tokens in the usage.
The adapter should populate reasoning_tokens by summing the token lengths of thinking blocks (Anthropic does not provide a separate reasoning token count, but the thinking block text can be used for estimation).
Thinking blocks carry a signature field that must be round-tripped verbatim in subsequent messages.

Gemini thinking (Gemini 3 models):

Gemini 3 Flash supports "thinking" via the thinkingConfig parameter.
Gemini reports thoughtsTokenCount in usageMetadata, which maps directly to reasoning_tokens.
Thinking content may be returned in the response as a thought part.

Why this matters: When switching between providers, reasoning token usage can vary dramatically. A query that uses 500 reasoning tokens on OpenAI GPT-5.2 might use 2000 thinking tokens on Claude. The unified SDK must track this accurately so callers can make informed cost decisions. Even though reasoning tokens make direct provider switching unfavorable (the thinking styles are different), the SDK should still translate correctly so higher-level tools can compare.

3.10 ResponseFormat

RECORD ResponseFormat:
    type        : String            -- "text", "json", or "json_schema"
    json_schema : Dict | None       -- required when type is "json_schema"
    strict      : Boolean           -- when true, provider enforces schema strictly (default: false)

3.11 Warning

RECORD Warning:
    message : String                -- human-readable description of the non-fatal issue
    code    : String | None         -- machine-readable warning code

3.12 RateLimitInfo

RECORD RateLimitInfo:
    requests_remaining  : Integer | None
    requests_limit      : Integer | None
    tokens_remaining    : Integer | None
    tokens_limit        : Integer | None
    reset_at            : Timestamp | None

Populated from provider response headers (e.g., x-ratelimit-remaining-requests). This data is informational; the library does not use it for proactive throttling.

3.13 StreamEvent

All stream events share a type discriminator field. The library normalizes provider-specific SSE formats into this unified event model.

RECORD StreamEvent:
    type              : StreamEventType | String

    -- text events
    delta             : String | None           -- incremental text
    text_id           : String | None           -- identifies which text segment this belongs to

    -- reasoning events
    reasoning_delta   : String | None           -- incremental reasoning/thinking text

    -- tool call events
    tool_call         : ToolCall | None         -- partial or complete tool call

    -- finish event
    finish_reason     : FinishReason | None
    usage             : Usage | None
    response          : Response | None         -- the full accumulated response

    -- error event
    error             : SDKError | None

    -- passthrough
    raw               : Dict | None             -- raw provider event for passthrough

3.14 StreamEventType

ENUM StreamEventType:
    STREAM_START        -- Stream has begun. May include warnings.
    TEXT_START           -- A new text segment has begun. Includes text_id.
    TEXT_DELTA           -- Incremental text content. Includes delta and text_id.
    TEXT_END             -- Text segment is complete. Includes text_id.
    REASONING_START     -- Model reasoning has begun.
    REASONING_DELTA     -- Incremental reasoning content.
    REASONING_END       -- Reasoning is complete.
    TOOL_CALL_START     -- A tool call has begun. Includes tool name and call ID.
    TOOL_CALL_DELTA     -- Incremental tool call arguments (partial JSON).
    TOOL_CALL_END       -- Tool call is fully formed and ready for execution.
    FINISH              -- Generation complete. Includes finish_reason, usage, response.
    ERROR               -- An error occurred during streaming.
    PROVIDER_EVENT      -- Raw provider event not mapped to the unified model.

The start/delta/end pattern. Text, reasoning, and tool call events follow a consistent start/delta/end lifecycle. This pattern enables:

Multiple concurrent segments -- a response can contain multiple text segments or tool calls in flight simultaneously. IDs correlate deltas to their segment.
Resource lifecycle -- consumers know when a segment begins and ends, enabling proper buffer management and UI updates.
Typed completion -- the end event carries the final accumulated value for its segment.

Consumers that only care about text deltas can filter for TEXT_DELTA events and ignore start/end events.

4. Generation and Streaming

4.1 Low-Level: Client.complete()

The fundamental blocking call. Sends a request, blocks until the model finishes, returns the full response.

response = client.complete(Request(
    model = "claude-opus-4-6",
    messages = [Message.user("Explain photosynthesis in one paragraph")],
    max_tokens = 500,
    temperature = 0.7
))

response.text           -- "Photosynthesis is..."
response.finish_reason  -- FinishReason(reason="stop", raw="end_turn")
response.usage          -- Usage(input_tokens=12, output_tokens=85, ...)

Behavior:

Routes to the resolved provider adapter.
Blocks until the model produces a complete response.
Returns a Response object.
Raises an exception on provider errors.
Does NOT retry automatically. Retries are the responsibility of Layer 4 (high-level API) or application code.

4.2 Low-Level: Client.stream()

The fundamental streaming call. Returns an asynchronous iterator of StreamEvent objects.

event_stream = client.stream(Request(
    model = "claude-opus-4-6",
    messages = [Message.user("Write a short story")]
))

FOR EACH event IN event_stream:
    IF event.type == TEXT_DELTA:
        PRINT(event.delta)
    ELSE IF event.type == FINISH:
        PRINT("Done. Tokens: " + event.usage.total_tokens)

Behavior:

Returns an async iterator immediately.
Yields StreamEvent objects as they arrive from the provider.
The stream terminates with a FINISH event containing the complete accumulated response.
Must be consumed or explicitly closed; abandoning a stream without closing it may leak connections.
Does NOT retry automatically.

4.3 High-Level: generate()

The primary blocking generation function. Wraps Client.complete() with tool execution loops, multi-step orchestration, prompt standardization, and automatic retries.

FUNCTION generate(
    model             : String,
    prompt            : String | None,               -- simple text prompt
    messages          : List<Message> | None,        -- full message history
    system            : String | None,               -- system message
    tools             : List<Tool> | None,           -- tools with optional execute handlers
    tool_choice       : ToolChoice | None,           -- auto/none/required/named
    max_tool_rounds   : Integer = 1,                 -- max tool execution loop iterations
    stop_when         : StopCondition | None,        -- custom stop condition for tool loops
    response_format   : ResponseFormat | None,
    temperature       : Float | None,
    top_p             : Float | None,
    max_tokens        : Integer | None,
    stop_sequences    : List<String> | None,
    reasoning_effort  : String | None,
    provider          : String | None,
    provider_options  : Dict | None,
    max_retries       : Integer = 2,                 -- retry count for transient errors
    timeout           : Float | TimeoutConfig | None,
    abort_signal      : AbortSignal | None,          -- cancellation signal
    client            : Client | None                -- override default client
) -> GenerateResult

Prompt standardization: Either prompt (a simple string, converted to a single user message) or messages (full conversation) is provided, not both. Using both is an error. The system parameter is always separate and prepended as a system message.

Tool execution loop (detailed in Section 5): When tools with execute handlers are provided and the model responds with tool calls, generate() automatically executes the tools, appends their results to the conversation, and calls the model again. This loop continues until the model responds without tool calls, max_tool_rounds is reached, or a stop condition is met.

max_tool_rounds semantics: The value represents the maximum number of times tool calls are executed and results are fed back. A value of 1 means: make the initial call, if the model returns tool calls execute them and make one more call. A value of 0 means no automatic tool execution (tools are returned to the caller). The total number of LLM calls is at most max_tool_rounds + 1.

GenerateResult

RECORD GenerateResult:
    text            : String                    -- text from the final step
    reasoning       : String | None             -- reasoning from the final step
    tool_calls      : List<ToolCall>            -- tool calls from the final step
    tool_results    : List<ToolResult>          -- tool results from the final step
    finish_reason   : FinishReason
    usage           : Usage                     -- usage from the final step
    total_usage     : Usage                     -- aggregated usage across ALL steps
    steps           : List<StepResult>          -- detailed results for each step
    response        : Response                  -- the final Response object
    output          : Any | None                -- parsed structured output (for generate_object)

StepResult

RECORD StepResult:
    text            : String
    reasoning       : String | None
    tool_calls      : List<ToolCall>
    tool_results    : List<ToolResult>
    finish_reason   : FinishReason
    usage           : Usage
    response        : Response
    warnings        : List<Warning>

4.4 High-Level: stream()

The primary streaming generation function. Equivalent to generate() but yields events incrementally.

result = stream(
    model = "claude-opus-4-6",
    prompt = "Write a haiku about coding"
)

FOR EACH event IN result:
    IF event.type == TEXT_DELTA:
        PRINT(event.delta)

-- After iteration, the full response is available:
response = result.response()

Accepts the same parameters as generate(). When tools with execute handlers are provided and the model makes tool calls, the stream pauses while tools execute, emits a step_finish event, then resumes streaming the model's next response.

The returned StreamResult provides:

Async iteration over events.
response() -- returns the accumulated Response after the stream ends.
text_stream -- an async iterable that yields only text deltas (convenience).

StreamResult

RECORD StreamResult:
    ASYNC ITERATOR over StreamEvent
    FUNCTION response() -> Response         -- accumulated response (available after stream ends)
    PROPERTY text_stream -> AsyncIterator<String>  -- yields only text deltas
    PROPERTY partial_response -> Response | None   -- current accumulated state at any point

StreamAccumulator

A utility that collects stream events into a complete Response:

accumulator = StreamAccumulator()

FOR EACH event IN stream:
    accumulator.process(event)

response = accumulator.response()   -- equivalent to what complete() would return

This bridges the two modes: any code that works with a Response can be used with streaming by accumulating first.

4.5 High-Level: generate_object()

Structured output generation with schema validation:

result = generate_object(
    model = "gpt-5.2",
    prompt = "Extract the person's name and age from: 'Alice is 30 years old'",
    schema = {
        "type": "object",
        "properties": {
            "name": { "type": "string" },
            "age": { "type": "integer" }
        },
        "required": ["name", "age"]
    }
)

result.output   -- { "name": "Alice", "age": 30 }  (parsed and validated)
result.text     -- raw text response

Implementation strategy by provider:

Provider	Strategy
OpenAI	Native `response_format: { type: "json_schema", ... }` with strict mode
Gemini	Native `responseMimeType: "application/json"` with `responseSchema`
Anthropic	Fallback: inject schema instructions into the system prompt, parse output. Alternatively, use tool-based extraction (define a tool whose input schema matches the desired output, force the model to call it).

If parsing or validation fails, the function raises NoObjectGeneratedError.

4.6 High-Level: stream_object()

Streaming structured output with partial object updates:

result = stream_object(
    model = "gpt-5.2",
    prompt = "Generate a list of 5 recipes",
    schema = recipes_schema
)

FOR EACH partial IN result:
    -- partial is a partially-parsed object that grows as tokens arrive
    PRINT("Recipes so far: " + LENGTH(partial.recipes))

final = result.object()  -- the complete, validated object

Uses incremental JSON parsing to yield partial objects as tokens arrive. This enables progressive UI rendering.

4.7 Cancellation and Timeouts

Abort Signals

Both generate() and stream() accept an abort signal for cooperative cancellation:

controller = AbortController()

-- In another thread/coroutine:
controller.abort()

-- The generate call raises AbortError if cancelled:
result = generate(model = "...", prompt = "...", abort_signal = controller.signal)

For streaming, cancellation closes the underlying connection and the stream raises AbortError.

Timeouts

Timeouts can be specified as a simple duration (total timeout) or a structured config:

RECORD TimeoutConfig:
    total       : Float | None      -- max time for the entire multi-step operation
    per_step    : Float | None      -- max time per individual LLM call

The library distinguishes three timeout scopes at the adapter level:

RECORD AdapterTimeout:
    connect     : Float             -- time to establish HTTP connection (default: 10s)
    request     : Float             -- time for entire request/response cycle (default: 120s)
    stream_read : Float             -- max time between consecutive stream events (default: 30s)

5. Tool Calling

5.1 Tool Definition

RECORD Tool:
    name        : String                    -- unique identifier; [a-zA-Z][a-zA-Z0-9_]* max 64 chars
    description : String                    -- human-readable description for the model
    parameters  : Dict                      -- JSON Schema defining the input (root must be "object")
    execute     : Function | None           -- handler function (if present, tool is "active")

Tool name constraints: Names must be valid identifiers: alphanumeric characters and underscores, starting with a letter. Maximum 64 characters. This is the strictest common subset across all providers. The library validates names at definition time.

Parameter schema: Parameters must be defined as a JSON Schema object with "type": "object" at the root. This is a universal requirement across all providers. The library passes this schema to the provider, which uses it to constrain argument generation.

Example:

weather_tool = Tool(
    name = "get_weather",
    description = "Get the current weather for a location",
    parameters = {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City name, e.g. 'San Francisco, CA'"
            },
            "unit": {
                "type": "string",
                "enum": ["celsius", "fahrenheit"],
                "description": "Temperature unit"
            }
        },
        "required": ["location"]
    },
    execute = get_weather_function
)

5.2 Tool Execute Handlers

The execute handler is a callable (sync or async) that receives parsed arguments and returns a result:

FUNCTION get_weather(location: String, unit: String = "celsius") -> String:
    -- Call weather API...
    RETURN "72F and sunny in " + location

Handler contract:

Input: Parsed JSON arguments as keyword arguments, or a single dictionary.
Output: A string, dictionary, list, or any JSON-serializable value.
Errors: Raise an exception to indicate tool failure. The library catches it and sends an error result to the model (with is_error = true), allowing the model to recover.

Tool context injection: Handlers can optionally receive injected context. The library inspects the handler's signature and injects recognized keyword arguments:

FUNCTION my_tool(
    query        : String,          -- tool parameter
    messages     : List<Message>,   -- injected: current conversation
    abort_signal : AbortSignal,     -- injected: cancellation signal
    tool_call_id : String           -- injected: ID of this call
) -> String:
    ...

5.3 ToolChoice

Controls whether and how the model uses tools:

RECORD ToolChoice:
    mode        : String            -- "auto", "none", "required", "named"
    tool_name   : String | None     -- required when mode is "named"

Mode	Behavior
auto	Model decides whether to call tools or respond with text.
none	Model must not call any tools, even if defined.
required	Model must call at least one tool.
named	Model must call the specific tool identified by tool_name.

Provider mapping:

SDK Mode	OpenAI	Anthropic	Gemini
auto	`"auto"`	`{"type": "auto"}`	`"AUTO"`
none	`"none"`	Omit tools from request	`"NONE"`
required	`"required"`	`{"type": "any"}`	`"ANY"`
named	`{"type":"function","function":{"name":"..."}}`	`{"type":"tool","name":"..."}`	`{"mode":"ANY","allowedFunctionNames":["..."]}`

Note on Anthropic none mode: Anthropic does not support tool_choice: {"type": "none"} when tools are present. The adapter must omit the tools array from the request body entirely.

If a provider does not support a particular mode, the adapter raises UnsupportedToolChoiceError. The supports_tool_choice(mode) method allows checking capabilities upfront.

5.4 ToolCall and ToolResult

Extracted from responses and produced by execute handlers:

RECORD ToolCall:
    id              : String            -- unique identifier (provider-assigned)
    name            : String            -- tool name
    arguments       : Dict              -- parsed JSON arguments
    raw_arguments   : String | None     -- raw argument string before parsing

RECORD ToolResult:
    tool_call_id    : String            -- correlates to ToolCall.id
    content         : String | Dict | List  -- the tool's output
    is_error        : Boolean           -- true if the tool execution failed

5.5 Active Tools vs Passive Tools

Active tools have an execute handler. When used with generate() or stream(), the library automatically executes them and loops until the model produces a final text response.

Passive tools have no execute handler. Tool calls are returned to the caller in the response, and the caller manages the execution loop manually using Client.complete().

Passive tools are useful when:

Tool execution requires external coordination (human approval, external orchestration).
The calling code has its own loop and state management.
Tools need to be executed in a specific order or with side effects between them.

5.6 Multi-Step Tool Loop

When generate() is called with active tools, the following loop executes:

FUNCTION tool_loop(request, tools, max_tool_rounds, stop_when):
    conversation = request.messages
    steps = []

    FOR round_num FROM 0 TO max_tool_rounds:
        response = client.complete(request_with(conversation))
        tool_calls = response.tool_calls

        -- Execute tools if the model wants to call them
        IF tool_calls AND response.finish_reason.reason == "tool_calls":
            tool_results = execute_all_tools(tools, tool_calls)  -- concurrent
        ELSE:
            tool_results = []

        step = StepResult(response, tool_calls, tool_results, ...)
        steps.APPEND(step)

        -- Check stop conditions
        IF tool_calls is empty OR response.finish_reason.reason != "tool_calls":
            BREAK   -- model is done (natural completion)
        IF round_num >= max_tool_rounds:
            BREAK   -- budget exhausted
        IF stop_when is not None AND stop_when(steps) == true:
            BREAK   -- custom stop condition met

        -- Continue conversation with tool results
        conversation.APPEND(response.message)            -- assistant message with tool calls
        FOR EACH result IN tool_results:
            conversation.APPEND(Message.tool_result(
                tool_call_id = result.tool_call_id,
                content = result.content,
                is_error = result.is_error
            ))

    RETURN GenerateResult from steps

5.7 Parallel Tool Execution

When the model returns multiple tool calls in a single response, they are logically independent (the model generated them simultaneously without seeing any results). The library MUST handle this correctly:

Execute all tool calls concurrently. Launch all execute handlers simultaneously (using async tasks, threads, or equivalent concurrency primitive).
Wait for ALL results before continuing. Do not send partial results back to the model. The continuation request must include results for every tool call from the previous response.
Send all results in a single continuation request. Bundle all tool results into the message history and make one LLM call, not one call per result.
Preserve ordering. Tool results should appear in the same order as the corresponding tool calls, even though execution may complete out of order.
Handle partial failures gracefully. If some tool executions succeed and others fail, send all results (with is_error = true for failures). Do not abort the entire batch because one tool failed.

FUNCTION execute_all_tools(tools, tool_calls):
    -- Launch all executions concurrently
    futures = []
    FOR EACH call IN tool_calls:
        tool = find_tool(tools, call.name)
        IF tool AND tool.execute:
            futures.APPEND(async_execute(tool.execute, call.arguments, call.id))
        ELSE:
            futures.APPEND(immediate_error(call.id, "Unknown tool: " + call.name))

    -- Wait for ALL to complete
    results = AWAIT_ALL(futures)

    RETURN results   -- List<ToolResult>, one per tool_call, in order

This is critical for downstream consumers like coding agents. When a model asks to read three files simultaneously, the SDK handles the concurrent execution and result batching so the coding agent's agentic loop does not have to manage it.

5.8 Tool Call Validation and Repair

Before passing arguments to the execute handler, the library:

Parses the JSON argument string.
Optionally validates against the tool's parameter schema.
If validation fails and a repair_tool_call function is provided, attempts repair (e.g., ask the model to fix the arguments).
If repair fails or is not configured, sends an error result to the model.

Unknown tool calls: When the model calls a tool not in the definitions, the library sends an error result rather than raising an exception. This gives the model a chance to correct its behavior.

5.9 Streaming with Tools

When streaming with active tools, the stream emits tool call events as they form. Between steps (after tool execution, before the next model call), a step_finish event is emitted. The consumer sees a continuous stream of events spanning multiple steps.

5.10 Tool Result Handling Across Providers

How tool results are translated to each provider's format:

SDK Format	OpenAI	Anthropic	Gemini
TOOL role message with ToolResultData	Separate `tool` messages with `tool_call_id`	`tool_result` content blocks in `user` message	`functionResponse` parts in `user` content

6. Error Handling and Retry

6.1 Error Taxonomy

All library errors inherit from a single base:

RECORD SDKError:
    message : String                -- human-readable description
    cause   : Exception | None      -- underlying exception, if any

Error hierarchy:

SDKError
 +-- ProviderError                      -- errors from the LLM provider
 |    +-- AuthenticationError           -- 401: invalid API key, expired token
 |    +-- AccessDeniedError             -- 403: insufficient permissions
 |    +-- NotFoundError                 -- 404: model not found, endpoint not found
 |    +-- InvalidRequestError           -- 400: malformed request, invalid parameters
 |    +-- RateLimitError                -- 429: rate limit exceeded
 |    +-- ServerError                   -- 500-599: provider internal error
 |    +-- ContentFilterError            -- response blocked by safety filter
 |    +-- ContextLengthError            -- input + output exceeds context window
 |    +-- QuotaExceededError            -- billing/usage quota exhausted
 +-- RequestTimeoutError                -- request or stream timed out
 +-- AbortError                         -- request cancelled via abort signal
 +-- NetworkError                       -- network-level failure
 +-- StreamError                        -- error during stream consumption
 +-- InvalidToolCallError               -- tool call arguments failed validation
 +-- NoObjectGeneratedError             -- structured output parsing/validation failed
 +-- ConfigurationError                 -- SDK misconfiguration (missing provider, etc.)

Note: Error class names are chosen to avoid shadowing common language built-in names (e.g., AccessDeniedError instead of PermissionError, NetworkError instead of ConnectionError, RequestTimeoutError instead of TimeoutError).

6.2 ProviderError Fields

RECORD ProviderError extends SDKError:
    provider    : String                -- which provider returned the error
    status_code : Integer | None        -- HTTP status code, if applicable
    error_code  : String | None         -- provider-specific error code
    retryable   : Boolean               -- whether this error is safe to retry
    retry_after : Float | None          -- seconds to wait before retrying
    raw         : Dict | None           -- raw error response body from the provider

6.3 Retryability Classification

Every error carries a retryable property.

Non-retryable errors (client mistakes -- retrying will not help):

Error	Status Code	Retryable
AuthenticationError	401	false
AccessDeniedError	403	false
NotFoundError	404	false
InvalidRequestError	400, 422	false
ContextLengthError	413	false
QuotaExceededError	(varies)	false
ContentFilterError	(varies)	false
ConfigurationError	(N/A)	false

Retryable errors (transient -- may succeed on retry):

Error	Status Code	Retryable
RateLimitError	429	true
ServerError	500-504	true
RequestTimeoutError	408	true
NetworkError	(N/A)	true
StreamError	(N/A)	true

Unknown errors default to retryable. This is a deliberate conservative choice: transient network issues and novel provider error codes are more common than permanent failures from unexpected codes. A false retry is cheaper than a false abort.

6.4 HTTP Status Code Mapping

Adapters map HTTP status codes to error types using this table:

Status	Error Type	Retryable
400	InvalidRequestError	false
401	AuthenticationError	false
403	AccessDeniedError	false
404	NotFoundError	false
408	RequestTimeoutError	true
413	ContextLengthError	false
422	InvalidRequestError	false
429	RateLimitError	true
500	ServerError	true
502	ServerError	true
503	ServerError	true
504	ServerError	true

For Gemini (which may use gRPC status codes):

gRPC Code	Error Type
NOT_FOUND	NotFoundError
INVALID_ARGUMENT	InvalidRequestError
UNAUTHENTICATED	AuthenticationError
PERMISSION_DENIED	AccessDeniedError
RESOURCE_EXHAUSTED	RateLimitError
UNAVAILABLE	ServerError
DEADLINE_EXCEEDED	RequestTimeoutError
INTERNAL	ServerError

6.5 Error Message Classification

For ambiguous cases where the status code alone is insufficient, the adapter checks the error message body for classification signals:

Messages containing "not found" or "does not exist" -> NotFoundError
Messages containing "unauthorized" or "invalid key" -> AuthenticationError
Messages containing "context length" or "too many tokens" -> ContextLengthError
Messages containing "content filter" or "safety" -> ContentFilterError

6.6 Retry Policy

RECORD RetryPolicy:
    max_retries         : Integer = 2       -- total retry attempts (not counting initial)
    base_delay          : Float = 1.0       -- initial delay in seconds
    max_delay           : Float = 60.0      -- maximum delay between retries
    backoff_multiplier  : Float = 2.0       -- exponential backoff factor
    jitter              : Boolean = true    -- add random jitter to prevent thundering herd
    on_retry            : Callback | None   -- called before each retry with (error, attempt, delay)

Exponential Backoff with Jitter

The delay for attempt n (0-indexed) is calculated as:

delay = MIN(base_delay * (backoff_multiplier ^ n), max_delay)
IF jitter:
    delay = delay * RANDOM(0.5, 1.5)   -- +/- 50% jitter

Example delays with defaults (base=1.0, multiplier=2.0, max=60.0):

Attempt	Base Delay	With Jitter (approx range)
0	1.0s	0.5s -- 1.5s
1	2.0s	1.0s -- 3.0s
2	4.0s	2.0s -- 6.0s
3	8.0s	4.0s -- 12.0s
4	16.0s	8.0s -- 24.0s

Retry-After Header

When the provider returns a Retry-After header (common with 429 responses):

If Retry-After is less than max_delay, use the provider's delay instead of the calculated backoff.
If Retry-After exceeds max_delay, do NOT retry. Raise the error immediately with retry_after set on the exception. This prevents silently waiting minutes for a rate limit to clear.

What Gets Retried

Retries apply to individual LLM calls, not to entire multi-step operations:

generate() with tools: Each step's LLM call is retried independently. A retry on step 3 does not re-execute steps 1 and 2.
stream(): Only the initial connection is retried. Once streaming has begun and partial data has been delivered, the library does not retry. Instead, the stream emits an error event.
generate_object(): The LLM call is retried. Schema validation failures are NOT retried (they indicate a model behavior issue, not a transient error).

Retry at the Adapter Level

Provider adapters do NOT retry by default. Retry logic lives in Layer 2 (provider utilities) and is applied by the high-level functions in Layer 4. Low-level Client.complete() and Client.stream() never retry automatically. Applications using the low-level API can compose retry behavior using a standalone retry() utility:

response = retry(
    FUNCTION: client.complete(request),
    policy = RetryPolicy(max_retries = 3)
)

Disabling Retries

Set max_retries = 0 to disable automatic retries in high-level functions.

6.7 Rate Limit Handling

When a provider returns HTTP 429, the library raises RateLimitError with retry_after extracted from the response header and retryable = true. With automatic retries enabled, rate limits are handled transparently up to the retry budget.

For applications that need proactive rate limiting (staying under limits rather than hitting them), use middleware:

FUNCTION rate_limit_middleware(request, next):
    token_bucket.acquire()   -- block until budget available
    RETURN next(request)

7. Provider Adapter Contract

This section provides detailed guidance for implementing a provider adapter. It is intended as a reference for anyone adding support for a new provider.

7.1 Interface Summary

Each adapter must implement:

INTERFACE ProviderAdapter:
    PROPERTY name : String

    FUNCTION complete(request: Request) -> Response
    FUNCTION stream(request: Request) -> AsyncIterator<StreamEvent>

Recommended optional methods:

    FUNCTION close() -> Void
    FUNCTION initialize() -> Void
    FUNCTION supports_tool_choice(mode: String) -> Boolean

7.2 Request Translation

The adapter must translate a unified Request into the provider's native API format. The general steps are:

Extract system messages. For Anthropic: extract from message list, pass as system parameter. For Gemini: extract and pass as systemInstruction. For OpenAI (Responses API): extract and pass as instructions parameter.
Translate messages. Convert each Message and its ContentParts to the provider's format.
Translate tools. Convert Tool definitions to the provider's tool format.
Translate tool choice. Map the unified ToolChoice to the provider's format.
Set generation parameters. Map temperature, top_p, max_tokens, stop_sequences, etc.
Apply response format. Translate ResponseFormat to the provider's structured output mechanism.
Apply provider options. Merge any provider-specific options from request.provider_options[provider_name] into the request body.

7.3Message Translation Details

OpenAI Message Translation (Responses API)

The Responses API uses a different message format than Chat Completions. Messages are passed in an input array rather than a messages array:

Unified Role    -> Responses API Handling
SYSTEM          -> Extracted to `instructions` parameter
USER            -> input item: { "type": "message", "role": "user", "content": [...] }
ASSISTANT       -> input item: { "type": "message", "role": "assistant", "content": [...] }
TOOL            -> input item: { "type": "function_call_output", "call_id": "...", "output": "..." }
DEVELOPER       -> Extracted to `instructions` parameter (or `developer` role input item)

ContentPart Translations:
  TEXT          -> { "type": "input_text", "text": "..." } (user) or { "type": "output_text", "text": "..." } (assistant)
  IMAGE (url)  -> { "type": "input_image", "image_url": "..." }
  IMAGE (data) -> { "type": "input_image", "image_url": "data:<mime>;base64,<data>" }
  TOOL_CALL    -> input item: { "type": "function_call", "id": "...", "name": "...", "arguments": "..." }
  TOOL_RESULT  -> input item: { "type": "function_call_output", "call_id": "...", "output": "..." }

Special behaviors:

System messages are extracted to the instructions parameter, not included in the input array.
The reasoning.effort parameter controls reasoning for o-series models ("low", "medium", "high").
Tool calls and results are top-level input items, not nested within messages.
For third-party OpenAI-compatible endpoints, use the Chat Completions format instead (see Section 7.10).

Anthropic Message Translation

Unified Role    -> Anthropic Handling
SYSTEM          -> Extracted to `system` parameter (not in messages array)
DEVELOPER       -> Merged with system parameter
USER            -> "user" role
ASSISTANT       -> "assistant" role
TOOL            -> "user" role with tool_result content blocks

ContentPart Translations:
  TEXT          -> { "type": "text", "text": "..." }
  IMAGE (url)  -> { "type": "image", "source": { "type": "url", "url": "..." } }
  IMAGE (data) -> { "type": "image", "source": { "type": "base64", "media_type": "...", "data": "..." } }
  TOOL_CALL    -> { "type": "tool_use", "id": "...", "name": "...", "input": { ... } }
  TOOL_RESULT  -> { "type": "tool_result", "tool_use_id": "...", "content": "...", "is_error": ... }
  THINKING     -> { "type": "thinking", "thinking": "...", "signature": "..." }
  REDACTED_THINKING -> { "type": "redacted_thinking", "data": "..." }

Special behaviors:

Strict alternation: Anthropic requires alternating user/assistant messages. The adapter must merge consecutive same-role messages by combining their content arrays.
Tool results in user messages: Anthropic requires tool results to appear in user-role messages, not a separate "tool" role.
Thinking block round-tripping: Thinking and redacted_thinking blocks from previous responses must be preserved exactly as received and included in subsequent assistant messages.
max_tokens is required: Anthropic always requires max_tokens. Default to 4096 if not specified.

Gemini Message Translation

Unified Role    -> Gemini Handling
SYSTEM          -> Extracted to `systemInstruction` field
DEVELOPER       -> Merged with systemInstruction
USER            -> "user" role
ASSISTANT       -> "model" role
TOOL            -> "user" role with functionResponse parts

ContentPart Translations:
  TEXT          -> { "text": "..." }
  IMAGE (url)  -> { "fileData": { "mimeType": "...", "fileUri": "..." } }
  IMAGE (data) -> { "inlineData": { "mimeType": "...", "data": "<base64>" } }
  TOOL_CALL    -> { "functionCall": { "name": "...", "args": { ... } } }
  TOOL_RESULT  -> { "functionResponse": { "name": "<function_name>", "response": { ... } } }

Special behaviors:

No developer role: Treated the same as system.
Tool call IDs: Gemini does not assign unique IDs to function calls. The adapter must generate synthetic unique IDs (e.g., "call_" + random_uuid()) and maintain a mapping from synthetic IDs to function names for when tool results are sent back.
Function response format: Gemini's functionResponse uses the function name (not the call ID) and expects a dict for the response (wrap strings in {"result": "..."} if needed).
Streaming format: Gemini uses JSON chunks (optionally via SSE with ?alt=sse), not a standard SSE endpoint.

7.4Tool Definition Translation

SDK Format	OpenAI	Anthropic	Gemini
Tool.name	tools[].function.name	tools[].name	tools[].functionDeclarations[].name
Tool.description	tools[].function.description	tools[].description	tools[].functionDeclarations[].description
Tool.parameters	tools[].function.parameters	tools[].input_schema	tools[].functionDeclarations[].parameters
Wrapper structure	`{"type":"function","function":{...}}`	`{"name":...,"description":...,"input_schema":...}`	`{"functionDeclarations":[{...}]}`

7.5 Response Translation

The adapter must parse the provider's response into the unified Response format:

Extract content parts. Parse the provider's content/parts array into List<ContentPart> with appropriate ContentKind tags.
Map finish reason. Translate the provider's finish/stop reason to the unified FinishReason (see mapping table in Section 3.8).
Extract usage. Map the provider's token count fields to Usage (see mapping table in Section 3.9).
Preserve raw response. Store the complete provider response in Response.raw for debugging.
Extract rate limit info. Parse x-ratelimit-* headers into RateLimitInfo if present.

7.6Error Translation

The adapter must translate HTTP errors into the error hierarchy:

Parse the response body for error details (message, error code).
Extract Retry-After header if present.
Map the HTTP status code to the appropriate error type using the table in Section 6.4.
For ambiguous cases, apply message-based classification (Section 6.5).
Preserve the raw error response in the raw field.

FUNCTION raise_error(http_response):
    body = parse_json(http_response.body)
    message = body.error.message OR http_response.text
    error_code = body.error.code OR body.error.type

    retry_after = None
    IF http_response.headers["retry-after"] EXISTS:
        retry_after = parse_float(http_response.headers["retry-after"])

    RAISE error_from_status_code(
        status_code = http_response.status,
        message = message,
        provider = self.name,
        error_code = error_code,
        raw = body,
        retry_after = retry_after
    )

7.7 Streaming Translation

The adapter translates provider-specific streaming formats into the unified StreamEvent model.

SSE Parsing

Most providers use Server-Sent Events (SSE). A proper SSE parser must handle:

event: lines (event type)
data: lines (payload, may span multiple lines)
retry: lines (reconnection interval)
Comment lines (starting with :)
Blank lines (event boundary)

The parser yields (event_type, data) tuples. Many providers include the event type in the JSON payload as well as in the SSE event field; prefer the JSON payload field for reliability.

OpenAI Streaming (Responses API)

The Responses API uses a different streaming format than Chat Completions:

Provider Format (Responses API):
    event: response.created        -- response object created
    event: response.in_progress    -- generation started
    event: response.output_text.delta  -- incremental text
    event: response.function_call_arguments.delta  -- incremental tool call args
    event: response.output_item.done   -- output item complete
    event: response.completed      -- generation complete, includes usage with reasoning_tokens

Translation:
    output_text.delta              -> TEXT_DELTA event (emit TEXT_START on first)
    function_call_arguments.delta  -> TOOL_CALL_DELTA event
    output_item.done (text)        -> TEXT_END event
    output_item.done (function)    -> TOOL_CALL_END event
    response.completed             -> FINISH event with usage (including reasoning_tokens)

The Responses API streaming format provides reasoning token counts in the final response.completed event, which is why it is required for reasoning models.

For the OpenAI-compatible adapter (Chat Completions), the streaming format is:

Provider Format (Chat Completions, for third-party endpoints):
    data: {"choices": [{"delta": {"content": "text"}, "finish_reason": null}]}
    data: {"choices": [{"delta": {"tool_calls": [{"index": 0, ...}]}}]}
    data: {"usage": {...}}
    data: [DONE]

Anthropic Streaming

Provider Format (SSE events):
    event: message_start       -- contains message metadata and input token count
    event: content_block_start -- new content block (text, tool_use, thinking)
    event: content_block_delta -- incremental content within a block
    event: content_block_stop  -- block complete
    event: message_delta       -- finish reason and output usage
    event: message_stop        -- stream complete

Translation:
    content_block_start (type=text)     -> TEXT_START
    content_block_delta (type=text)     -> TEXT_DELTA
    content_block_stop  (type=text)     -> TEXT_END
    content_block_start (type=tool_use) -> TOOL_CALL_START
    content_block_delta (type=tool_use) -> TOOL_CALL_DELTA
    content_block_stop  (type=tool_use) -> TOOL_CALL_END
    content_block_start (type=thinking) -> REASONING_START
    content_block_delta (type=thinking) -> REASONING_DELTA
    content_block_stop  (type=thinking) -> REASONING_END
    message_stop                        -> FINISH with accumulated response

Gemini Streaming

Gemini uses SSE (with ?alt=sse query parameter) or newline-delimited JSON chunks.

Provider Format (SSE):
    data: {"candidates": [{"content": {"parts": [{"text": "..."}]}}], "usageMetadata": {...}}

Translation:
    parts[].text present               -> TEXT_DELTA (emit TEXT_START on first)
    parts[].functionCall present       -> TOOL_CALL_START + TOOL_CALL_END (full call in one chunk)
    candidate.finishReason present     -> TEXT_END
    Final chunk                        -> FINISH with accumulated response

Note: Gemini typically delivers function calls as complete objects in a single chunk, not incrementally. Emit both TOOL_CALL_START and TOOL_CALL_END for each function call.

7.8 Provider Quirks Reference

A summary of provider-specific behaviors that adapters must handle:

Concern	OpenAI	Anthropic	Gemini
Native API	Responses API (`/v1/responses`)	Messages API (`/v1/messages`)	Gemini API (`/v1beta/...generateContent`)
System message handling	`instructions` parameter	Extracted to `system` parameter	Extracted to `systemInstruction`
Developer role	`instructions` or `developer` role	Merged with system	Merged with system
Message alternation	No strict requirement	Strict user/assistant alternation	No strict requirement
Reasoning tokens	Via `output_tokens_details`; requires Responses API	Via thinking blocks (text visible)	Via `thoughtsTokenCount`
Tool call IDs	Provider-assigned unique IDs	Provider-assigned unique IDs	No unique IDs (use function name)
Tool result format	Separate `tool` role messages	`tool_result` blocks in user messages	`functionResponse` in user content
Tool choice "none"	`"none"`	Omit tools from request entirely	`"NONE"`
max_tokens	Optional	Required (default to 4096)	Optional (as `maxOutputTokens`)
Thinking blocks	Not exposed (o-series internal)	`thinking` / `redacted_thinking` blocks	`thought` parts (2.5 models)
Structured output	Native json_schema mode	Prompt engineering or tool extraction	Native responseSchema
Streaming protocol	SSE with `data:` lines	SSE with event type + data lines	SSE (with `?alt=sse`) or JSON
Stream termination	`data: [DONE]`	`message_stop` event	Final chunk (no explicit signal)
Finish reason for tools	`tool_calls`	`tool_use`	No dedicated reason (infer from parts)
Image input	Data URI in `image_url`	`base64` source with `media_type`	`inlineData` with `mimeType`
Prompt caching	Automatic (free, 50% discount)	Requires explicit `cache_control` blocks (90% discount)	Automatic (free prefix caching)
Beta/feature headers	N/A (features in request body)	`anthropic-beta` header (comma-separated)	N/A (features in request body)
Authentication	Bearer token in Authorization	`x-api-key` header	`key` query parameter
API versioning	Via URL path (/v1/)	`anthropic-version` header	Via URL path (/v1beta/)

7.9 Adding a New Provider

To add support for a new provider:

Implement the ProviderAdapter interface. Create a class with name, complete(), and stream().
Write request translation. Map the unified Request to the provider's API format, following the patterns in Section 7.3.
Write response translation. Map the provider's response to the unified Response, following Section 7.5.
Write error translation. Map HTTP errors to the error hierarchy, following Section 7.6.
Write streaming translation. Map the provider's streaming format to StreamEvent objects, following Section 7.7.
Handle provider quirks. Document any provider-specific behaviors (like Anthropic's strict alternation or Gemini's missing tool call IDs) and handle them in the adapter.
Register the adapter. Add it to Client.from_env() with the appropriate environment variable checks, or allow users to register it programmatically.

7.10OpenAI-Compatible Endpoints

Many third-party services (vLLM, Ollama, Together AI, Groq, etc.) expose an OpenAI-compatible Chat Completions API. For these services, provide a separate OpenAICompatibleAdapter that uses the Chat Completions endpoint (/v1/chat/completions) rather than the Responses API:

adapter = OpenAICompatibleAdapter(
    api_key = "...",
    base_url = "https://my-vllm-instance.example.com/v1"
)

This adapter is distinct from the primary OpenAI adapter (which uses the Responses API) because third-party services typically only implement the Chat Completions protocol. The compatible adapter does not support reasoning tokens, built-in tools, or other Responses API features.

Appendix A: Conversation Examples

A.1 Simple Text Conversation

messages = [
    Message(role = SYSTEM, content = [ContentPart(kind = TEXT, text = "You are a helpful assistant.")]),
    Message(role = USER,   content = [ContentPart(kind = TEXT, text = "What is 2 + 2?")])
]

A.2 Multimodal Conversation

messages = [
    Message(role = USER, content = [
        ContentPart(kind = TEXT, text = "What do you see in this image?"),
        ContentPart(kind = IMAGE, image = ImageData(url = "https://example.com/photo.jpg"))
    ])
]

A.3 Tool Use Conversation

messages = [
    Message(role = USER, content = [
        ContentPart(kind = TEXT, text = "What is the weather in San Francisco?")
    ]),
    Message(role = ASSISTANT, content = [
        ContentPart(kind = TOOL_CALL, tool_call = ToolCallData(
            id = "call_123",
            name = "get_weather",
            arguments = { "city": "San Francisco" }
        ))
    ]),
    Message(role = TOOL, content = [
        ContentPart(kind = TOOL_RESULT, tool_result = ToolResultData(
            tool_call_id = "call_123",
            content = "72F, sunny",
            is_error = false
        ))
    ], tool_call_id = "call_123"),
    Message(role = ASSISTANT, content = [
        ContentPart(kind = TEXT, text = "The weather in San Francisco is 72F and sunny.")
    ])
]

A.4 Thinking Blocks (Anthropic Extended Thinking)

messages = [
    Message(role = USER, content = [
        ContentPart(kind = TEXT, text = "Solve this complex math problem...")
    ]),
    Message(role = ASSISTANT, content = [
        ContentPart(kind = THINKING, thinking = ThinkingData(
            text = "Let me work through this step by step...",
            signature = "sig_abc123"
        )),
        ContentPart(kind = TEXT, text = "The answer is 42.")
    ])
]

When continuing a conversation that includes thinking blocks, the thinking content parts must be included in the message history so the provider can verify their integrity.

Appendix B: High-Level API Usage Examples

B.1 Simple Generation

result = generate(model = "claude-opus-4-6", prompt = "Explain quantum computing")
PRINT(result.text)
PRINT(result.usage.total_tokens)

B.2 Generation with Tools

result = generate(
    model = "claude-opus-4-6",
    system = "You are a helpful assistant with access to weather data.",
    prompt = "What is the weather in San Francisco?",
    tools = [weather_tool],
    max_tool_rounds = 5
)

PRINT(result.text)                              -- final text after all tool rounds
PRINT(LENGTH(result.steps))                     -- number of steps taken
PRINT(result.total_usage.total_tokens)          -- aggregated token count

B.3 Streaming

result = stream(model = "claude-opus-4-6", prompt = "Write a poem")

FOR EACH event IN result:
    IF event.type == TEXT_DELTA:
        PRINT(event.delta)

response = result.response()
PRINT(response.usage)

B.4 Structured Output

result = generate_object(
    model = "gpt-5.2",
    prompt = "Extract the person's name and age from: 'Alice is 30 years old'",
    schema = {
        "type": "object",
        "properties": {
            "name": { "type": "string" },
            "age": { "type": "integer" }
        },
        "required": ["name", "age"]
    }
)

PRINT(result.output)    -- { "name": "Alice", "age": 30 }

B.5 Provider Fallback Pattern

TRY:
    result = generate(model = "claude-opus-4-6", prompt = "...")
CATCH ProviderError:
    result = generate(model = "gpt-5.2", provider = "openai", prompt = "...")

B.6 Middleware for Logging

FUNCTION logging_middleware(request, next):
    start_time = NOW()
    LOG_INFO("LLM request: provider=" + request.provider + " model=" + request.model)
    response = next(request)
    elapsed = NOW() - start_time
    LOG_INFO("LLM response: tokens=" + response.usage.total_tokens + " latency=" + elapsed)
    RETURN response

client = Client(
    providers = { "anthropic": AnthropicAdapter(...) },
    middleware = [logging_middleware]
)

Appendix C: Design Decision Rationale

This appendix summarizes key design decisions and the reasoning behind them. These are provided so that implementors understand the "why" and can make informed tradeoffs if their language or context demands different choices.

Why a single Request type instead of per-method parameter lists? A single Request object is easier to construct, pass around, modify, and serialize than many keyword arguments. It enables middleware to inspect and modify requests uniformly. High-level functions like generate(model=..., prompt=...) provide ergonomic shorthand.

Why ship a model catalog if model strings work as-is? Model strings work for developers who know which models exist. But AI coding agents building on top of this SDK often hallucinate model identifiers from stale training data. The catalog gives them a reliable, up-to-date source of valid model IDs and capabilities. Unknown model strings still pass through -- the catalog is advisory, not restrictive.

Why explicit provider on Request instead of model-based routing? Several providers serve models with overlapping names. Explicit routing avoids ambiguity. For the common case, default_provider removes boilerplate.

Why separate generate() and stream()? The return types are fundamentally different: GenerateResult vs StreamResult. A boolean flag loses type safety.

Why start/delta/end events instead of flat deltas? Flat deltas lose structural information when a response contains multiple text segments or interleaved tool calls. The pattern adds minimal overhead but enables correct handling of complex responses.

Why max_tool_rounds instead of unlimited looping? Unbounded loops risk infinite cycles. A default of 1 is safe. Higher values are an explicit opt-in.

Why JSON Schema for tool parameters instead of language-native types? JSON Schema is the universal parameter description format across all providers. Language-native helpers can generate JSON Schema, but JSON Schema is the canonical format.

Why send error results to the model instead of raising exceptions? Raising on tool failure aborts the entire generation. Sending an error result gives the model the opportunity to retry, use a different tool, or explain the failure.

Why default to retrying unknown errors? Transient failures are more common than permanent ones from unexpected codes. A false retry is cheaper than a false abort.

Why not retry timed-out requests by default? Timeouts indicate the operation is inherently slow, not that it failed transiently. Applications can opt in to timeout retries.

Why use each provider's native API instead of just targeting Chat Completions everywhere? The Chat Completions API is an OpenAI-specific protocol that other providers partially mimic as a convenience shim. Using it as the universal transport loses critical capabilities: OpenAI's own Responses API exposes reasoning tokens that Chat Completions hides; Anthropic's Messages API supports thinking blocks, prompt caching, and beta headers; Gemini's native API supports grounding and code execution. The unified SDK's value is precisely in abstracting over these different native APIs so callers don't have to. Using a compatibility layer would defeat the purpose.

Why handle parallel tool execution in the SDK instead of leaving it to the caller? When a model returns 5 parallel tool calls, the correct behavior is to execute all 5 concurrently, wait for all to complete, and send all 5 results back in one continuation. This is fiddly to implement correctly (error handling, ordering, timeout management) and identical for every consumer. Doing it once in the SDK means coding agents and other downstream tools get it for free.

8. Definition of Done

This section defines how to validate that an implementation of this spec is complete and correct. Use this as a checklist during development. An implementation is considered done when every item is checked off.

8.1 Core Infrastructure

Client can be constructed from environment variables (Client.from_env())
Client can be constructed programmatically with explicit adapter instances
Provider routing works: requests are dispatched to the correct adapter based on provider field
Default provider is used when provider is omitted from a request
ConfigurationError is raised when no provider is configured and no default is set
Middleware chain executes in correct order (request: registration order, response: reverse order)
Module-level default client works (set_default_client() and implicit lazy initialization)
Model catalog is populated with current models and get_model_info() / list_models() return correct data

8.2 Provider Adapters

For EACH provider (OpenAI, Anthropic, Gemini), verify:

8.3 Message & Content Model

Messages with text-only content work across all providers
Image input works: images sent as URL, base64 data, and local file path are correctly translated per provider
Audio and document content parts are handled (or gracefully rejected if provider doesn't support them)
Tool call content parts round-trip correctly (assistant message with tool calls -> tool result messages -> next assistant message)
Thinking blocks (Anthropic) are preserved and round-tripped with signatures intact
Redacted thinking blocks are passed through verbatim
Multimodal messages (text + images in the same message) work

8.4 Generation

8.5 Reasoning Tokens

OpenAI reasoning models (GPT-5.2 series, etc.) return reasoning_tokens in Usage via the Responses API
reasoning_effort parameter is passed through correctly to OpenAI reasoning models
Anthropic extended thinking blocks are returned as THINKING content parts when enabled
Thinking block signature field is preserved for round-tripping
Gemini thinking tokens (thoughtsTokenCount) are mapped to reasoning_tokens in Usage
Usage correctly reports reasoning_tokens as distinct from output_tokens

8.6 Prompt Caching

OpenAI: caching works automatically via the Responses API (no client-side configuration needed)
OpenAI: Usage.cache_read_tokens is populated from usage.prompt_tokens_details.cached_tokens
Anthropic: adapter automatically injects cache_control breakpoints on the system prompt, tool definitions, and conversation prefix
Anthropic: prompt-caching-2024-07-31 beta header is included automatically when cache_control is present
Anthropic: Usage.cache_read_tokens and Usage.cache_write_tokens are populated correctly
Anthropic: automatic caching can be disabled via provider_options.anthropic.auto_cache = false
Gemini: automatic prefix caching works (no client-side configuration needed)
Gemini: Usage.cache_read_tokens is populated from usageMetadata.cachedContentTokenCount
Multi-turn agentic session: verify that turn 5+ shows significant cache_read_tokens (>50% of input tokens) for all three providers

8.7 Tool Calling

8.8 Error Handling & Retry

All errors in the hierarchy are raised for the correct HTTP status codes (see Section 6.4 table)
retryable flag is set correctly on each error type
Exponential backoff with jitter works: delays increase correctly per attempt
Retry-After header overrides calculated backoff when present (and within max_delay)
max_retries = 0 disables automatic retries
Rate limit errors (429) are retried transparently
Non-retryable errors (401, 403, 404) are raised immediately without retry
Retries apply per-step, not to the entire multi-step operation
Streaming does not retry after partial data has been delivered

8.9 Cross-Provider Parity

Run this validation matrix -- each cell must pass:

Test Case	OpenAI	Anthropic	Gemini
Simple text generation	[ ]	[ ]	[ ]
Streaming text generation	[ ]	[ ]	[ ]
Image input (base64)	[ ]	[ ]	[ ]
Image input (URL)	[ ]	[ ]	[ ]
Single tool call + execution	[ ]	[ ]	[ ]
Multiple parallel tool calls	[ ]	[ ]	[ ]
Multi-step tool loop (3+ rounds)	[ ]	[ ]	[ ]
Streaming with tool calls	[ ]	[ ]	[ ]
Structured output (generate_object)	[ ]	[ ]	[ ]
Reasoning/thinking token reporting	[ ]	[ ]	[ ]
Error handling (invalid API key -> 401)	[ ]	[ ]	[ ]
Error handling (rate limit -> 429)	[ ]	[ ]	[ ]
Usage token counts are accurate	[ ]	[ ]	[ ]
Prompt caching (cache_read_tokens > 0 on turn 2+)	[ ]	[ ]	[ ]
Provider-specific options pass through	[ ]	[ ]	[ ]

8.10 Integration Smoke Test

The ultimate validation: run this end-to-end test against all three providers with real API keys.

-- 1. Basic generation across all providers
FOR EACH provider IN ["anthropic", "openai", "gemini"]:
    result = generate(
        model = get_latest_model(provider).id,
        prompt = "Say hello in one sentence.",
        max_tokens = 100,
        provider = provider
    )
    ASSERT result.text is not empty
    ASSERT result.usage.input_tokens > 0
    ASSERT result.usage.output_tokens > 0
    ASSERT result.finish_reason.reason == "stop"

-- 2. Streaming
stream_result = stream(model = "claude-opus-4-6", prompt = "Write a haiku.")
text_chunks = []
FOR EACH event IN stream_result:
    IF event.type == TEXT_DELTA:
        text_chunks.APPEND(event.delta)
ASSERT JOIN(text_chunks) == stream_result.response().text

-- 3. Tool calling with parallel execution
result = generate(
    model = "claude-opus-4-6",
    prompt = "What is the weather in San Francisco and New York?",
    tools = [weather_tool],    -- tool that returns mock weather data
    max_tool_rounds = 3
)
ASSERT LENGTH(result.steps) >= 2               -- at least: initial call + after tool results
ASSERT result.text contains "San Francisco"
ASSERT result.text contains "New York"

-- 4. Image input
result = generate(
    model = "claude-opus-4-6",
    messages = [Message(role=USER, content=[
        ContentPart(kind=TEXT, text="What do you see?"),
        ContentPart(kind=IMAGE, image=ImageData(data=<png_bytes>, media_type="image/png"))
    ])]
)
ASSERT result.text is not empty

-- 5. Structured output
result = generate_object(
    model = "gpt-5.2",
    prompt = "Extract: Alice is 30 years old",
    schema = {"type":"object", "properties":{"name":{"type":"string"},"age":{"type":"integer"}}, "required":["name","age"]}
)
ASSERT result.output.name == "Alice"
ASSERT result.output.age == 30

-- 6. Error handling
TRY:
    generate(model = "nonexistent-model-xyz", prompt = "test", provider = "openai")
    FAIL("Should have raised an error")
CATCH NotFoundError:
    PASS   -- correct error type

If all items in this section are checked off, the unified LLM library is complete and ready for use as the foundation for a coding agent or any other LLM-powered application.

FilesExpand file tree

unified-llm-spec.md

Latest commit

History

unified-llm-spec.md

File metadata and controls

Unified LLM Client Specification

Table of Contents

1. Overview and Goals

1.1 Problem Statement

1.2 Design Principles

1.3 Reference Open-Source Projects

2. Architecture

2.1 Four-Layer Architecture

2.2 Client Configuration

Environment-Based Setup

Programmatic Setup

Provider Resolution

Model String Convention

2.3 Middleware / Interceptor Pattern

2.4 Provider Adapter Interface

Optional Adapter Methods

2.5 Module-Level Default Client

2.6 Concurrency Model

2.7 Native API Usage (Critical)

2.8 Provider Beta Headers and Feature Flags

2.9 Model Catalog

2.10 Prompt Caching (Critical for Cost)

3. Data Model

3.1 Message

Convenience Constructors

Text Accessor

3.2 Role

3.3 ContentPart (Tagged Union)

3.4 ContentKind

3.5 Content Data Structures

ImageData

AudioData

DocumentData

ToolCallData

ToolResultData

ThinkingData

3.6 Request

Provider Options (Escape Hatch)

3.7 Response

3.8 FinishReason

3.9 Usage

Reasoning Token Handling (Critical)

3.10 ResponseFormat

3.11 Warning

3.12 RateLimitInfo

3.13 StreamEvent

3.14 StreamEventType

4. Generation and Streaming

4.1 Low-Level: Client.complete()

4.2 Low-Level: Client.stream()

4.3 High-Level: generate()

GenerateResult

StepResult

4.4 High-Level: stream()

StreamResult

StreamAccumulator

4.5 High-Level: generate_object()

4.6 High-Level: stream_object()

4.7 Cancellation and Timeouts

Abort Signals

Timeouts

5. Tool Calling

5.1 Tool Definition

5.2 Tool Execute Handlers

5.3 ToolChoice

5.4 ToolCall and ToolResult

5.5 Active Tools vs Passive Tools

5.6 Multi-Step Tool Loop

5.7 Parallel Tool Execution

5.8 Tool Call Validation and Repair

5.9 Streaming with Tools

5.10 Tool Result Handling Across Providers

6. Error Handling and Retry

6.1 Error Taxonomy