This document is a consolidated, language-agnostic specification for building a unified client library that provides a single interface across multiple LLM providers (OpenAI, Anthropic, Google Gemini, and others). It is designed to be implementable from scratch by any developer or coding agent in any programming language.
- Overview and Goals
- Architecture
- Data Model
- Generation and Streaming
- Tool Calling
- Error Handling and Retry
- Provider Adapter Contract
- Definition of Done
Applications that use large language models face a fragmented ecosystem. Each provider -- OpenAI, Anthropic, Google Gemini, and others -- exposes a different HTTP API with different message formats, tool calling conventions, streaming protocols, error shapes, and authentication mechanisms. Switching providers or supporting multiple providers requires rewriting request construction, response parsing, error handling, and streaming logic.
This specification defines a unified client library that solves this problem. Developers write provider-agnostic code and switch models by changing a single string identifier. No rewiring, no adapter-specific imports.
Provider-agnostic. Application code should not contain provider-specific logic. The unified interface handles all translation. Provider-specific features are available through an explicit escape hatch, not through leaky abstractions.
Minimal surface area. The library exposes a small number of types and functions. A developer can learn the full API in under an hour. Fewer concepts means fewer bugs and easier maintenance.
Streaming-first. Streaming is a first-class operation, not a flag on a blocking call. The two generation modes -- blocking and streaming -- have separate methods with distinct return types. This makes the type system work for the developer.
Composable. Cross-cutting concerns (logging, retries, caching) are handled through middleware, not baked into the core. The core client is a thin routing layer.
Escape hatches over false abstractions. When a provider offers a unique feature that does not map to the unified model, the library provides a pass-through mechanism rather than pretending the feature does not exist or building an unreliable shim.
The following open-source projects solve related problems and are worth studying for patterns, trade-offs, and lessons learned. They are not dependencies; implementors may take inspiration from any combination of them.
-
Vercel AI SDK (https://github.com/vercel/ai) -- TypeScript. Multi-provider architecture with a versioned provider specification. Clean separation between provider interfaces and high-level convenience API (
generateText/streamText/generateObject). Demonstrates the start/delta/end streaming event pattern and a composable middleware system. -
LiteLLM (https://github.com/BerriAI/litellm) -- Python. Supports 100+ providers behind a single
completion()interface. Demonstrates the value of a unified calling convention and the model string routing pattern. Shows how to handle the long tail of provider-specific quirks at scale. -
pi-ai (https://github.com/badlogic/pi-mono/tree/main/packages/ai) -- TypeScript. A multi-provider AI client from @mariozechner's pi-mono project. Demonstrates cost tracking, usage aggregation, and a clean provider adapter pattern with explicit reasoning token support.
The library is organized into four layers, each with a clear responsibility boundary.
Layer 4: High-Level API generate(), stream(), generate_object()
---------------------------------------------------------------
Layer 3: Core Client Client, provider routing, middleware hooks
---------------------------------------------------------------
Layer 2: Provider Utilities Shared helpers for building adapters
---------------------------------------------------------------
Layer 1: Provider Specification ProviderAdapter interface, shared types
Layer 1 -- Provider Specification. Defines the contract that every provider adapter must implement. Contains only interface definitions and shared type definitions. No implementation logic. This layer is the stability contract: it changes rarely and only with explicit versioning. A new provider is added by implementing this interface, not by modifying it.
Layer 2 -- Provider Utilities. Contains shared code for building adapters: HTTP client helpers, Server-Sent Events (SSE) parsing, retry logic, response normalization utilities, JSON schema translation helpers. Provider adapter authors import this layer; application developers generally do not.
Layer 3 -- Core Client. The main orchestration layer. The Client object holds registered provider adapters, routes requests by provider identifier, applies middleware, and manages configuration. This is the primary import for application code that wants direct control over requests.
Layer 4 -- High-Level API. Provides convenience functions (generate(), stream(), generate_object()) that wrap the Client with ergonomic defaults. Most application code uses this layer. These functions handle prompt standardization, tool execution loops, output parsing, structured output validation, and automatic retries.
The recommended setup for most applications reads standard environment variables per provider:
client = Client.from_env()
Environment variable conventions:
| Provider | Required Variable | Optional Variables |
|---|---|---|
| OpenAI | OPENAI_API_KEY | OPENAI_BASE_URL, OPENAI_ORG_ID, OPENAI_PROJECT_ID |
| Anthropic | ANTHROPIC_API_KEY | ANTHROPIC_BASE_URL |
| Gemini | GEMINI_API_KEY | GEMINI_BASE_URL |
Alternate key names may be accepted (e.g., GOOGLE_API_KEY as a fallback for GEMINI_API_KEY). Only providers whose keys are present in the environment are registered. The first registered provider becomes the default.
For full control, adapters are constructed explicitly and registered with the Client:
adapter = OpenAIAdapter(
api_key = "sk-...",
base_url = "https://custom-endpoint.example.com/v1",
default_headers = { "X-Custom": "value" },
timeout = 30.0
)
client = Client(
providers = { "openai": adapter },
default_provider = "openai"
)
When a request specifies a provider field, the Client routes to that adapter. When the provider field is omitted, the Client uses default_provider. If no default is set and no provider is specified, the Client raises a configuration error. The Client never guesses.
Model identifiers are the provider's native string (e.g., "gpt-5.2", "claude-opus-4-6", "gemini-3-flash-preview"). The library does not invent its own model namespace. This avoids the maintenance burden of mapping tables and ensures new models work immediately without library updates. If a model string could be ambiguous (multiple providers support it), the provider field on the request disambiguates.
The Client supports middleware for cross-cutting concerns. Middleware wraps provider calls and can inspect or modify requests, inspect or modify responses, and perform side effects.
FUNCTION logging_middleware(request, next):
LOG("Request to " + request.provider + "/" + request.model)
response = next(request)
LOG("Response: " + response.usage.total_tokens + " tokens")
RETURN response
client = Client(
providers = { ... },
middleware = [logging_middleware]
)
Execution order. Middleware runs in registration order for the request phase (first registered = first to execute) and in reverse order for the response phase. This is the standard onion/chain-of-responsibility pattern.
Streaming middleware. Middleware must also apply to streaming requests. For streaming, middleware wraps the event iterator and can observe or transform individual stream events. The middleware interface should support both modes:
FUNCTION streaming_middleware(request, next):
event_iterator = next(request)
FOR EACH event IN event_iterator:
log_event(event)
YIELD event
Common middleware use cases:
- Logging
- Request/response caching
- Cost tracking and budgets
- Client-side rate limiting
- Prompt injection detection
- Circuit breaker pattern
Every provider must implement this interface:
INTERFACE ProviderAdapter:
PROPERTY name : String -- e.g., "openai", "anthropic", "gemini"
FUNCTION complete(request: Request) -> Response
-- Send a request, block until the model finishes, return the full response.
FUNCTION stream(request: Request) -> AsyncIterator<StreamEvent>
-- Send a request, return an asynchronous iterator of stream events.
Why two methods, not one. A single method with a stream: boolean flag was rejected because the return types are fundamentally different. A blocking Response and an asynchronous event stream have different consumption patterns, error handling models, and lifetime semantics. Separate methods make the type system work for the developer.
No separate send_tool_outputs method. Tool results are sent by including them in the message history of a new complete() or stream() call. This matches how Anthropic and Gemini work natively. The OpenAI adapter handles any translation internally.
These methods are recommended but not required:
FUNCTION close() -> Void
-- Release resources (HTTP connections, etc.). Called by Client.close().
FUNCTION initialize() -> Void
-- Validate configuration on startup. Called by Client on registration.
FUNCTION supports_tool_choice(mode: String) -> Boolean
-- Query whether a particular tool choice mode is supported.
High-level functions (generate(), stream(), etc.) use a module-level default client. This client is lazily initialized from environment variables on first use. Applications can override it:
set_default_client(my_client)
-- Or pass explicitly per call:
result = generate(model = "...", prompt = "...", client = my_client)
The library is async-first. All provider calls are non-blocking. The complete() and stream() methods are asynchronous. The high-level API provides both async and sync wrappers for languages that support both paradigms.
Multiple concurrent requests to different providers (or the same provider) are safe. The Client holds no mutable state between requests. Provider adapters manage their own connection pools and must be safe for concurrent use.
Each provider adapter MUST use the provider's native, preferred API -- not a compatibility layer. This is a fundamental design requirement. Using a lowest-common-denominator compatibility layer (such as only targeting the OpenAI Chat Completions API shape) loses access to provider-specific capabilities like reasoning tokens, extended thinking, prompt caching, and advanced tool features.
| Provider | Required API | Why Not Compatibility Layer |
|---|---|---|
| OpenAI | Responses API (/v1/responses) |
The Responses API properly surfaces reasoning tokens, supports built-in tools (web search, file search, code interpreter), and is OpenAI's forward-looking API. The Chat Completions API does not return reasoning tokens for reasoning models (GPT-5.2 series, etc.) and lacks server-side conversation state. |
| Anthropic | Messages API (/v1/messages) |
The Messages API supports extended thinking with thinking blocks and signatures, prompt caching with cache_control, beta feature headers, and the strict user/assistant alternation model. There is no alternative. |
| Gemini | Gemini API (/v1beta/models/*/generateContent) |
The native Gemini API supports grounding with Google Search, code execution, system instructions, and cached content. OpenAI-compatible endpoints for Gemini are limited shims. |
The unified SDK abstracts over these different APIs so that callers write provider-agnostic code, but internally each adapter speaks the provider's native protocol. This is the entire value proposition: the complexity of three different APIs is handled once in the adapters so that downstream consumers (like a coding agent) never have to think about it.
Providers frequently gate new features behind beta headers or feature flags. The unified SDK must support passing these through cleanly.
Anthropic beta headers. Anthropic uses the anthropic-beta header to enable features like:
max-tokens-3-5-sonnet-2025-04-14-- enables 1M token context for certain modelsinterleaved-thinking-2025-05-14-- enables interleaved thinking blockstoken-efficient-tools-2025-02-19-- more efficient tool token usageprompt-caching-2024-07-31-- enables prompt caching
These must be passed as HTTP headers on the request. The adapter should accept them via provider_options:
request = Request(
model = "claude-opus-4-6",
messages = [ ... ],
provider_options = {
"anthropic": {
"beta_headers": ["interleaved-thinking-2025-05-14"]
}
}
)
The Anthropic adapter joins these into a comma-separated anthropic-beta header value.
OpenAI feature flags. The Responses API supports enabling built-in tools and features via the request body (e.g., tools: [{"type": "web_search_preview"}]). These should be supported through provider_options or by extending the tool definitions.
Gemini configuration. Gemini supports safety settings, grounding configuration, and cached content references as part of the request body. These should be passable through provider_options.
The key principle: the unified interface handles the common 90% of cases. The provider_options escape hatch handles the remaining 10% without requiring library changes for every new provider feature.
The SDK should ship with a catalog of known models to help consumers (especially AI coding agents) select valid model identifiers without guessing or hallucinating model names. The catalog is advisory, not restrictive -- unknown model strings are still passed through to the provider.
RECORD ModelInfo:
id : String -- the model's API identifier (e.g., "claude-opus-4-6")
provider : String -- which provider serves this model
display_name : String -- human-readable name (e.g., "Claude Opus 4.6")
context_window : Integer -- maximum total tokens (input + output)
max_output : Integer | None -- maximum output tokens
supports_tools : Boolean -- whether the model supports tool calling
supports_vision : Boolean -- whether the model accepts image inputs
supports_reasoning : Boolean -- whether the model produces reasoning tokens
input_cost_per_million : Float | None -- cost per 1M input tokens (USD)
output_cost_per_million : Float | None -- cost per 1M output tokens (USD)
aliases : List<String> -- shorthand names (e.g., ["sonnet", "claude-sonnet"])
At the time of writing (February 2026), the top models available through each provider's API are:
| Provider | Top Model(s) |
|---|---|
| Anthropic | Claude Opus 4.6, Claude Sonnet 4.5 |
| OpenAI | GPT-5.2 series (GPT-5.2, GPT-5.2-codex) |
| Gemini | Gemini 3 Pro (Preview), Gemini 3 Flash (Preview) |
Implementations should default to the latest available models when no model is specified by the caller, and should prefer newer models in any model selection logic. However, the catalog must also include older models that are still served by the APIs, as callers may need them for cost, latency, or compatibility reasons.
Example catalog (keep this updated as new models release):
MODELS = [
-- ==========================================================
-- Anthropic -- prefer Claude Opus 4.6 for top quality
-- ==========================================================
ModelInfo(id="claude-opus-4-6", provider="anthropic", display_name="Claude Opus 4.6", context_window=200000, supports_tools=true, supports_vision=true, supports_reasoning=true),
ModelInfo(id="claude-sonnet-4-5", provider="anthropic", display_name="Claude Sonnet 4.5", context_window=200000, supports_tools=true, supports_vision=true, supports_reasoning=true),
-- ==========================================================
-- OpenAI -- prefer GPT-5.2 series for top quality
-- ==========================================================
ModelInfo(id="gpt-5.2", provider="openai", display_name="GPT-5.2", context_window=1047576, supports_tools=true, supports_vision=true, supports_reasoning=true),
ModelInfo(id="gpt-5.2-mini", provider="openai", display_name="GPT-5.2 Mini", context_window=1047576, supports_tools=true, supports_vision=true, supports_reasoning=true),
ModelInfo(id="gpt-5.2-codex", provider="openai", display_name="GPT-5.2 Codex", context_window=1047576, supports_tools=true, supports_vision=true, supports_reasoning=true),
-- ==========================================================
-- Gemini -- prefer Gemini 3 Flash Preview for latest
-- ==========================================================
ModelInfo(id="gemini-3-pro-preview", provider="gemini", display_name="Gemini 3 Pro (Preview)", context_window=1048576, supports_tools=true, supports_vision=true, supports_reasoning=true),
ModelInfo(id="gemini-3-flash-preview", provider="gemini", display_name="Gemini 3 Flash (Preview)", context_window=1048576, supports_tools=true, supports_vision=true, supports_reasoning=true),
]
Lookup functions:
get_model_info(model_id: String) -> ModelInfo | None
-- Returns the catalog entry for a model, or None if unknown.
list_models(provider: String | None) -> List<ModelInfo>
-- Returns all known models, optionally filtered by provider.
get_latest_model(provider: String, capability: String | None) -> ModelInfo | None
-- Returns the newest/best model for a provider, optionally filtered by capability
-- (e.g., "reasoning", "vision", "tools"). Useful for coding agents that want
-- to always use the latest available model.
Why a catalog matters for coding agents: When an AI coding agent builds on top of this SDK, it needs to select models by capability (e.g., "pick a model that supports vision" or "pick the cheapest model that supports tools"). Without a catalog, the agent must hallucinate model identifiers from its training data, which go stale as providers release new models. The catalog gives the agent a reliable, up-to-date source of truth.
The catalog should be shipped as a data file (JSON or similar) that can be updated independently of the library code. Consider auto-generating it from provider documentation or APIs. When in doubt, prefer the latest models -- they are generally more capable, and the SDK should make it easy to stay current.
Prompt caching allows providers to reuse computation from previous requests when the prefix of the conversation is unchanged. For agentic workloads where the system prompt and conversation history are identical across many turns, caching can reduce input token costs by 50-90%. The unified SDK MUST support caching for each provider.
| Provider | Caching Behavior | SDK Action Required |
|---|---|---|
| OpenAI | Automatic -- the Responses API caches shared prefixes server-side | None. Use the Responses API and report cache_read_tokens from usage. |
| Gemini | Automatic -- prefix caching for repeated content, plus explicit cachedContent API for long contexts |
None for automatic. Expose explicit caching via provider_options. |
| Anthropic | Not automatic. Requires explicit cache_control annotations on content blocks. |
The Anthropic adapter must inject cache_control breakpoints automatically for agentic workloads. |
Anthropic is the only provider where the SDK must do extra work. Without cache_control annotations, every turn re-processes the entire system prompt and conversation history at full price. With proper caching, cached input tokens cost 90% less. This is the single highest-ROI optimization for agentic workloads.
All three providers report cache statistics. The SDK must map these to Usage.cache_read_tokens and Usage.cache_write_tokens so callers can verify caching is working.
This section defines all types used by the library. The notation uses a language-neutral struct/record style. Field types use these conventions:
String-- textInteger-- whole numberFloat-- decimal numberBoolean-- true/falseBytes-- raw binary dataDict-- key-value mapList<T>-- ordered collection of TT | None-- optional (nullable) valueT | U-- union / either type
The fundamental unit of conversation. A conversation is an ordered List<Message>.
RECORD Message:
role : Role -- who produced this message
content : List<ContentPart> -- the message body (multimodal)
name : String | None -- for tool messages and developer attribution
tool_call_id : String | None -- links a tool-result message to its tool call
For common cases, factory methods create properly structured Message objects:
Message.system("You are a helpful assistant.")
Message.user("What is 2 + 2?")
Message.assistant("The answer is 4.")
Message.tool_result(tool_call_id = "call_123", content = "72F and sunny", is_error = false)
A convenience property on Message that concatenates text from all text content parts:
message.text -> String
-- Returns the concatenation of all ContentPart entries where kind == TEXT.
-- Returns empty string if no text parts exist.
Five roles cover the semantics of all major providers:
ENUM Role:
SYSTEM -- High-level instructions shaping model behavior. Typically first.
USER -- Human input. Text, images, audio, documents.
ASSISTANT -- Model output. Text, tool calls, thinking blocks.
TOOL -- Tool execution results, linked by tool_call_id.
DEVELOPER -- Privileged instructions from the application (not the end user).
Provider mapping for roles:
| SDK Role | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| SYSTEM | system role |
Extracted to system parameter |
systemInstruction |
| USER | user role |
user role |
user role |
| ASSISTANT | assistant role |
assistant role |
model role |
| TOOL | tool role |
tool_result block in user msg |
functionResponse in user |
| DEVELOPER | developer role |
Merged with system | Merged with system |
Each message contains a list of ContentPart objects. Using a list rather than a single string enables multimodal messages (text interleaved with images), structured assistant responses (text interleaved with tool calls and thinking blocks), and tool results that include images.
ContentPart uses a tagged-union pattern: the kind field determines which data field is populated.
RECORD ContentPart:
kind : ContentKind | String -- discriminator tag
text : String | None -- populated when kind == TEXT
image : ImageData | None -- populated when kind == IMAGE
audio : AudioData | None -- populated when kind == AUDIO
document : DocumentData | None -- populated when kind == DOCUMENT
tool_call : ToolCallData | None -- populated when kind == TOOL_CALL
tool_result : ToolResultData | None -- populated when kind == TOOL_RESULT
thinking : ThinkingData | None -- populated when kind == THINKING or REDACTED_THINKING
Note: The kind field accepts both the enum and arbitrary strings. This allows extension for provider-specific content kinds without modifying the core enum.
ENUM ContentKind:
TEXT -- Plain text. The most common kind.
IMAGE -- Image as URL, base64, or file reference.
AUDIO -- Audio as URL or raw bytes with media type.
DOCUMENT -- Document (PDF, etc.) as URL, base64, or file reference.
TOOL_CALL -- A model-initiated tool invocation.
TOOL_RESULT -- The result of executing a tool call.
THINKING -- Model reasoning/thinking content.
REDACTED_THINKING -- Redacted reasoning (Anthropic). Opaque, must round-trip verbatim.
Direction constraints:
| Kind | May appear in roles |
|---|---|
| TEXT | SYSTEM, USER, ASSISTANT, DEVELOPER, TOOL |
| IMAGE | USER (input), ASSISTANT (generated) |
| AUDIO | USER (input) |
| DOCUMENT | USER (input) |
| TOOL_CALL | ASSISTANT (output) |
| TOOL_RESULT | TOOL (response) |
| THINKING | ASSISTANT (output) |
| REDACTED_THINKING | ASSISTANT (output) |
RECORD ImageData:
url : String | None -- URL pointing to the image
data : Bytes | None -- raw image bytes
media_type : String | None -- MIME type, e.g. "image/png", "image/jpeg"
detail : String | None -- processing fidelity hint: "auto", "low", "high"
Exactly one of url or data must be provided. The adapter base64-encodes data if the provider requires it. media_type defaults to "image/png" when data is provided and no type is specified.
Image upload is critical for multimodal capabilities. Many models (Claude, GPT-4.1, Gemini) accept image inputs for analysis, code screenshot reading, diagram understanding, and more. The SDK must handle image upload correctly across all providers:
| Concern | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| URL images | image_url.url field |
source.type = "url" with url field |
fileData.fileUri field |
| Base64 images | image_url.url as data URI (data:mime;base64,...) |
source.type = "base64" with data + media_type |
inlineData with data + mimeType |
| File path (local) | Read file, base64-encode, send as data URI | Read file, base64-encode, send as base64 source | Read file, base64-encode, send as inlineData |
| Supported formats | PNG, JPEG, GIF, WEBP | PNG, JPEG, GIF, WEBP | PNG, JPEG, GIF, WEBP, HEIC, HEIF |
| Max image size | 20MB | ~5MB per image (base64 encoded) | Varies by method |
| Detail/fidelity hint | detail: "auto", "low", "high" |
Not supported (ignore) | Not supported (ignore) |
Convenience: file path support. The SDK should accept a local file path as a convenience. When url looks like a local file path (starts with /, ./, or ~), the adapter reads the file, infers the MIME type from the extension, base64-encodes the contents, and sends it using the provider's inline data format. This makes it easy for coding agents to send screenshots and diagrams without manual encoding.
RECORD AudioData:
url : String | None
data : Bytes | None
media_type : String | None -- e.g. "audio/wav", "audio/mp3"
RECORD DocumentData:
url : String | None
data : Bytes | None
media_type : String | None -- e.g. "application/pdf"
file_name : String | None -- optional display name
RECORD ToolCallData:
id : String -- unique identifier for this call (provider-assigned)
name : String -- tool name
arguments : Dict | String -- parsed JSON arguments or raw argument string
type : String -- "function" (default) or "custom"
The id field is assigned by the provider and is required for linking tool results back to calls. For providers that do not assign unique IDs (e.g., Gemini), the adapter must generate synthetic unique IDs (e.g., "call_" + random_uuid()) and maintain a mapping to the function name.
RECORD ToolResultData:
tool_call_id : String -- the ToolCallData.id this result answers
content : String | Dict -- the tool's output (text or structured)
is_error : Boolean -- whether the tool execution failed
image_data : Bytes | None -- optional image result
image_media_type: String | None -- MIME type for the image result
When is_error is true, the model understands the tool failed and can adjust its approach.
RECORD ThinkingData:
text : String -- the thinking/reasoning content
signature : String | None -- provider-specific signature for round-tripping
redacted : Boolean -- true if this is redacted thinking (opaque content)
Thinking blocks from Anthropic's extended thinking must be preserved exactly as received and included in subsequent messages. The signature field enables this. Redacted thinking blocks contain opaque data that cannot be read but must be passed back verbatim.
Cross-provider portability: Thinking blocks with signatures are only valid when continuing with the same provider and model. When switching providers, the adapter should strip signatures and optionally convert the thinking text to a user-visible context message.
The single input type for both complete() and stream():
RECORD Request:
model : String -- required; provider's native model ID
messages : List<Message> -- required; the conversation
provider : String | None -- optional; uses default if omitted
tools : List<ToolDefinition> | None -- optional
tool_choice : ToolChoice | None -- optional; defaults to AUTO if tools present
response_format : ResponseFormat | None -- optional; text, json, or json_schema
temperature : Float | None
top_p : Float | None
max_tokens : Integer | None
stop_sequences : List<String> | None
reasoning_effort : String | None -- "none", "low", "medium", "high"
metadata : Dict<String, String> | None -- arbitrary key-value pairs
provider_options : Dict | None -- escape hatch for provider-specific params
The provider_options field passes through provider-specific parameters that the unified interface does not model. Each adapter extracts the options it understands and ignores the rest.
request = Request(
model = "claude-opus-4-6",
messages = [ ... ],
provider_options = {
"anthropic": {
"thinking": { "type": "enabled", "budget_tokens": 10000 },
"beta_features": ["interleaved-thinking-2025-05-14"]
}
}
)
Code that uses provider_options is explicitly not portable. The library documents this tradeoff.
RECORD Response:
id : String -- provider-assigned response ID
model : String -- actual model used (may differ from requested)
provider : String -- which provider fulfilled the request
message : Message -- the assistant's response as a Message
finish_reason : FinishReason -- why generation stopped
usage : Usage -- token counts
raw : Dict | None -- raw provider response JSON (for debugging)
warnings : List<Warning> -- non-fatal issues (optional, may be empty)
rate_limit : RateLimitInfo | None -- rate limit metadata from headers (optional)
Convenience accessors on Response:
response.text -> String -- concatenated text from all text parts
response.tool_calls -> List<ToolCall> -- extracted tool calls from the message
response.reasoning -> String | None -- concatenated reasoning/thinking text
A dual representation preserving both portable semantics and provider-specific detail:
RECORD FinishReason:
reason : String -- unified: one of the values below
raw : String | None -- the provider's native finish reason string
Unified reason values:
| Value | Meaning |
|---|---|
stop |
Natural end of generation (model stopped) |
length |
Output reached max_tokens limit |
tool_calls |
Model wants to invoke one or more tools |
content_filter |
Response blocked by safety/content filter |
error |
An error occurred during generation |
other |
Provider-specific reason not mapped above |
Provider finish reason mapping:
| Provider | Provider Value | Unified Value |
|---|---|---|
| OpenAI | stop | stop |
| OpenAI | length | length |
| OpenAI | tool_calls | tool_calls |
| OpenAI | content_filter | content_filter |
| Anthropic | end_turn | stop |
| Anthropic | stop_sequence | stop |
| Anthropic | max_tokens | length |
| Anthropic | tool_use | tool_calls |
| Gemini | STOP | stop |
| Gemini | MAX_TOKENS | length |
| Gemini | SAFETY | content_filter |
| Gemini | RECITATION | content_filter |
| Gemini | (has tool calls) | tool_calls |
Note: Gemini does not have a dedicated "tool_calls" finish reason. The adapter infers it from the presence of functionCall parts in the response.
RECORD Usage:
input_tokens : Integer -- tokens in the prompt
output_tokens : Integer -- tokens generated by the model
total_tokens : Integer -- input + output
reasoning_tokens : Integer | None -- tokens used for chain-of-thought reasoning
cache_read_tokens : Integer | None -- tokens served from prompt cache
cache_write_tokens : Integer | None -- tokens written to prompt cache
raw : Dict | None -- raw provider usage data
Usage objects must support addition for aggregating across multi-step operations:
usage_a + usage_b -> Usage
-- Sums integer fields.
-- For optional fields: if either side is non-None, sum them (treating None as 0).
-- If both sides are None for an optional field, the result is None.
Provider usage field mapping:
| SDK Field | OpenAI Field | Anthropic Field | Gemini Field |
|---|---|---|---|
| input_tokens | usage.prompt_tokens | usage.input_tokens | usageMetadata.promptTokenCount |
| output_tokens | usage.completion_tokens | usage.output_tokens | usageMetadata.candidatesTokenCount |
| reasoning_tokens | usage.completion_tokens_details.reasoning_tokens | (see note below) | usageMetadata.thoughtsTokenCount |
| cache_read_tokens | usage.prompt_tokens_details.cached_tokens | usage.cache_read_input_tokens | usageMetadata.cachedContentTokenCount |
| cache_write_tokens | (not provided) | usage.cache_creation_input_tokens | (not provided) |
Reasoning tokens are tokens the model uses for internal chain-of-thought before producing visible output. Properly tracking and surfacing reasoning tokens is essential for cost management and debugging, because reasoning tokens are billed as output tokens but are not visible in the response text.
OpenAI reasoning models (GPT-5.2 series, etc.):
- The Responses API (
/v1/responses) is REQUIRED for reasoning models. The Chat Completions API does not return reasoning token breakdowns for these models. The Responses API returnsusage.output_tokens_details.reasoning_tokenswhich tells you exactly how many tokens were spent on reasoning vs. visible output. - The
reasoning_effortrequest parameter ("low", "medium", "high") controls how much reasoning the model does. This maps toreasoning.effortin the Responses API request body. - Reasoning content is not visible in the response (OpenAI does not expose the thinking text for GPT-5.2 series models). The adapter should still populate
reasoning_tokensin Usage so callers can track costs.
Anthropic extended thinking (Claude with thinking enabled):
- Extended thinking is enabled via the
thinkingparameter (throughprovider_options) and requires specific beta headers. - Anthropic surfaces thinking as explicit
thinkingcontent blocks in the response. These blocks contain the actual reasoning text and count towardoutput_tokensin the usage. - The adapter should populate
reasoning_tokensby summing the token lengths of thinking blocks (Anthropic does not provide a separate reasoning token count, but the thinking block text can be used for estimation). - Thinking blocks carry a
signaturefield that must be round-tripped verbatim in subsequent messages.
Gemini thinking (Gemini 3 models):
- Gemini 3 Flash supports "thinking" via the
thinkingConfigparameter. - Gemini reports
thoughtsTokenCountinusageMetadata, which maps directly toreasoning_tokens. - Thinking content may be returned in the response as a
thoughtpart.
Why this matters: When switching between providers, reasoning token usage can vary dramatically. A query that uses 500 reasoning tokens on OpenAI GPT-5.2 might use 2000 thinking tokens on Claude. The unified SDK must track this accurately so callers can make informed cost decisions. Even though reasoning tokens make direct provider switching unfavorable (the thinking styles are different), the SDK should still translate correctly so higher-level tools can compare.
RECORD ResponseFormat:
type : String -- "text", "json", or "json_schema"
json_schema : Dict | None -- required when type is "json_schema"
strict : Boolean -- when true, provider enforces schema strictly (default: false)
RECORD Warning:
message : String -- human-readable description of the non-fatal issue
code : String | None -- machine-readable warning code
RECORD RateLimitInfo:
requests_remaining : Integer | None
requests_limit : Integer | None
tokens_remaining : Integer | None
tokens_limit : Integer | None
reset_at : Timestamp | None
Populated from provider response headers (e.g., x-ratelimit-remaining-requests). This data is informational; the library does not use it for proactive throttling.
All stream events share a type discriminator field. The library normalizes provider-specific SSE formats into this unified event model.
RECORD StreamEvent:
type : StreamEventType | String
-- text events
delta : String | None -- incremental text
text_id : String | None -- identifies which text segment this belongs to
-- reasoning events
reasoning_delta : String | None -- incremental reasoning/thinking text
-- tool call events
tool_call : ToolCall | None -- partial or complete tool call
-- finish event
finish_reason : FinishReason | None
usage : Usage | None
response : Response | None -- the full accumulated response
-- error event
error : SDKError | None
-- passthrough
raw : Dict | None -- raw provider event for passthrough
ENUM StreamEventType:
STREAM_START -- Stream has begun. May include warnings.
TEXT_START -- A new text segment has begun. Includes text_id.
TEXT_DELTA -- Incremental text content. Includes delta and text_id.
TEXT_END -- Text segment is complete. Includes text_id.
REASONING_START -- Model reasoning has begun.
REASONING_DELTA -- Incremental reasoning content.
REASONING_END -- Reasoning is complete.
TOOL_CALL_START -- A tool call has begun. Includes tool name and call ID.
TOOL_CALL_DELTA -- Incremental tool call arguments (partial JSON).
TOOL_CALL_END -- Tool call is fully formed and ready for execution.
FINISH -- Generation complete. Includes finish_reason, usage, response.
ERROR -- An error occurred during streaming.
PROVIDER_EVENT -- Raw provider event not mapped to the unified model.
The start/delta/end pattern. Text, reasoning, and tool call events follow a consistent start/delta/end lifecycle. This pattern enables:
- Multiple concurrent segments -- a response can contain multiple text segments or tool calls in flight simultaneously. IDs correlate deltas to their segment.
- Resource lifecycle -- consumers know when a segment begins and ends, enabling proper buffer management and UI updates.
- Typed completion -- the end event carries the final accumulated value for its segment.
Consumers that only care about text deltas can filter for TEXT_DELTA events and ignore start/end events.
The fundamental blocking call. Sends a request, blocks until the model finishes, returns the full response.
response = client.complete(Request(
model = "claude-opus-4-6",
messages = [Message.user("Explain photosynthesis in one paragraph")],
max_tokens = 500,
temperature = 0.7
))
response.text -- "Photosynthesis is..."
response.finish_reason -- FinishReason(reason="stop", raw="end_turn")
response.usage -- Usage(input_tokens=12, output_tokens=85, ...)
Behavior:
- Routes to the resolved provider adapter.
- Blocks until the model produces a complete response.
- Returns a Response object.
- Raises an exception on provider errors.
- Does NOT retry automatically. Retries are the responsibility of Layer 4 (high-level API) or application code.
The fundamental streaming call. Returns an asynchronous iterator of StreamEvent objects.
event_stream = client.stream(Request(
model = "claude-opus-4-6",
messages = [Message.user("Write a short story")]
))
FOR EACH event IN event_stream:
IF event.type == TEXT_DELTA:
PRINT(event.delta)
ELSE IF event.type == FINISH:
PRINT("Done. Tokens: " + event.usage.total_tokens)
Behavior:
- Returns an async iterator immediately.
- Yields StreamEvent objects as they arrive from the provider.
- The stream terminates with a FINISH event containing the complete accumulated response.
- Must be consumed or explicitly closed; abandoning a stream without closing it may leak connections.
- Does NOT retry automatically.
The primary blocking generation function. Wraps Client.complete() with tool execution loops, multi-step orchestration, prompt standardization, and automatic retries.
FUNCTION generate(
model : String,
prompt : String | None, -- simple text prompt
messages : List<Message> | None, -- full message history
system : String | None, -- system message
tools : List<Tool> | None, -- tools with optional execute handlers
tool_choice : ToolChoice | None, -- auto/none/required/named
max_tool_rounds : Integer = 1, -- max tool execution loop iterations
stop_when : StopCondition | None, -- custom stop condition for tool loops
response_format : ResponseFormat | None,
temperature : Float | None,
top_p : Float | None,
max_tokens : Integer | None,
stop_sequences : List<String> | None,
reasoning_effort : String | None,
provider : String | None,
provider_options : Dict | None,
max_retries : Integer = 2, -- retry count for transient errors
timeout : Float | TimeoutConfig | None,
abort_signal : AbortSignal | None, -- cancellation signal
client : Client | None -- override default client
) -> GenerateResult
Prompt standardization: Either prompt (a simple string, converted to a single user message) or messages (full conversation) is provided, not both. Using both is an error. The system parameter is always separate and prepended as a system message.
Tool execution loop (detailed in Section 5): When tools with execute handlers are provided and the model responds with tool calls, generate() automatically executes the tools, appends their results to the conversation, and calls the model again. This loop continues until the model responds without tool calls, max_tool_rounds is reached, or a stop condition is met.
max_tool_rounds semantics: The value represents the maximum number of times tool calls are executed and results are fed back. A value of 1 means: make the initial call, if the model returns tool calls execute them and make one more call. A value of 0 means no automatic tool execution (tools are returned to the caller). The total number of LLM calls is at most max_tool_rounds + 1.
RECORD GenerateResult:
text : String -- text from the final step
reasoning : String | None -- reasoning from the final step
tool_calls : List<ToolCall> -- tool calls from the final step
tool_results : List<ToolResult> -- tool results from the final step
finish_reason : FinishReason
usage : Usage -- usage from the final step
total_usage : Usage -- aggregated usage across ALL steps
steps : List<StepResult> -- detailed results for each step
response : Response -- the final Response object
output : Any | None -- parsed structured output (for generate_object)
RECORD StepResult:
text : String
reasoning : String | None
tool_calls : List<ToolCall>
tool_results : List<ToolResult>
finish_reason : FinishReason
usage : Usage
response : Response
warnings : List<Warning>
The primary streaming generation function. Equivalent to generate() but yields events incrementally.
result = stream(
model = "claude-opus-4-6",
prompt = "Write a haiku about coding"
)
FOR EACH event IN result:
IF event.type == TEXT_DELTA:
PRINT(event.delta)
-- After iteration, the full response is available:
response = result.response()
Accepts the same parameters as generate(). When tools with execute handlers are provided and the model makes tool calls, the stream pauses while tools execute, emits a step_finish event, then resumes streaming the model's next response.
The returned StreamResult provides:
- Async iteration over events.
response()-- returns the accumulated Response after the stream ends.text_stream-- an async iterable that yields only text deltas (convenience).
RECORD StreamResult:
ASYNC ITERATOR over StreamEvent
FUNCTION response() -> Response -- accumulated response (available after stream ends)
PROPERTY text_stream -> AsyncIterator<String> -- yields only text deltas
PROPERTY partial_response -> Response | None -- current accumulated state at any point
A utility that collects stream events into a complete Response:
accumulator = StreamAccumulator()
FOR EACH event IN stream:
accumulator.process(event)
response = accumulator.response() -- equivalent to what complete() would return
This bridges the two modes: any code that works with a Response can be used with streaming by accumulating first.
Structured output generation with schema validation:
result = generate_object(
model = "gpt-5.2",
prompt = "Extract the person's name and age from: 'Alice is 30 years old'",
schema = {
"type": "object",
"properties": {
"name": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["name", "age"]
}
)
result.output -- { "name": "Alice", "age": 30 } (parsed and validated)
result.text -- raw text response
Implementation strategy by provider:
| Provider | Strategy |
|---|---|
| OpenAI | Native response_format: { type: "json_schema", ... } with strict mode |
| Gemini | Native responseMimeType: "application/json" with responseSchema |
| Anthropic | Fallback: inject schema instructions into the system prompt, parse output. Alternatively, use tool-based extraction (define a tool whose input schema matches the desired output, force the model to call it). |
If parsing or validation fails, the function raises NoObjectGeneratedError.
Streaming structured output with partial object updates:
result = stream_object(
model = "gpt-5.2",
prompt = "Generate a list of 5 recipes",
schema = recipes_schema
)
FOR EACH partial IN result:
-- partial is a partially-parsed object that grows as tokens arrive
PRINT("Recipes so far: " + LENGTH(partial.recipes))
final = result.object() -- the complete, validated object
Uses incremental JSON parsing to yield partial objects as tokens arrive. This enables progressive UI rendering.
Both generate() and stream() accept an abort signal for cooperative cancellation:
controller = AbortController()
-- In another thread/coroutine:
controller.abort()
-- The generate call raises AbortError if cancelled:
result = generate(model = "...", prompt = "...", abort_signal = controller.signal)
For streaming, cancellation closes the underlying connection and the stream raises AbortError.
Timeouts can be specified as a simple duration (total timeout) or a structured config:
RECORD TimeoutConfig:
total : Float | None -- max time for the entire multi-step operation
per_step : Float | None -- max time per individual LLM call
The library distinguishes three timeout scopes at the adapter level:
RECORD AdapterTimeout:
connect : Float -- time to establish HTTP connection (default: 10s)
request : Float -- time for entire request/response cycle (default: 120s)
stream_read : Float -- max time between consecutive stream events (default: 30s)
RECORD Tool:
name : String -- unique identifier; [a-zA-Z][a-zA-Z0-9_]* max 64 chars
description : String -- human-readable description for the model
parameters : Dict -- JSON Schema defining the input (root must be "object")
execute : Function | None -- handler function (if present, tool is "active")
Tool name constraints: Names must be valid identifiers: alphanumeric characters and underscores, starting with a letter. Maximum 64 characters. This is the strictest common subset across all providers. The library validates names at definition time.
Parameter schema: Parameters must be defined as a JSON Schema object with "type": "object" at the root. This is a universal requirement across all providers. The library passes this schema to the provider, which uses it to constrain argument generation.
Example:
weather_tool = Tool(
name = "get_weather",
description = "Get the current weather for a location",
parameters = {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "City name, e.g. 'San Francisco, CA'"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
},
execute = get_weather_function
)
The execute handler is a callable (sync or async) that receives parsed arguments and returns a result:
FUNCTION get_weather(location: String, unit: String = "celsius") -> String:
-- Call weather API...
RETURN "72F and sunny in " + location
Handler contract:
- Input: Parsed JSON arguments as keyword arguments, or a single dictionary.
- Output: A string, dictionary, list, or any JSON-serializable value.
- Errors: Raise an exception to indicate tool failure. The library catches it and sends an error result to the model (with
is_error = true), allowing the model to recover.
Tool context injection: Handlers can optionally receive injected context. The library inspects the handler's signature and injects recognized keyword arguments:
FUNCTION my_tool(
query : String, -- tool parameter
messages : List<Message>, -- injected: current conversation
abort_signal : AbortSignal, -- injected: cancellation signal
tool_call_id : String -- injected: ID of this call
) -> String:
...
Controls whether and how the model uses tools:
RECORD ToolChoice:
mode : String -- "auto", "none", "required", "named"
tool_name : String | None -- required when mode is "named"
| Mode | Behavior |
|---|---|
| auto | Model decides whether to call tools or respond with text. |
| none | Model must not call any tools, even if defined. |
| required | Model must call at least one tool. |
| named | Model must call the specific tool identified by tool_name. |
Provider mapping:
| SDK Mode | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| auto | "auto" |
{"type": "auto"} |
"AUTO" |
| none | "none" |
Omit tools from request | "NONE" |
| required | "required" |
{"type": "any"} |
"ANY" |
| named | {"type":"function","function":{"name":"..."}} |
{"type":"tool","name":"..."} |
{"mode":"ANY","allowedFunctionNames":["..."]} |
Note on Anthropic none mode: Anthropic does not support tool_choice: {"type": "none"} when tools are present. The adapter must omit the tools array from the request body entirely.
If a provider does not support a particular mode, the adapter raises UnsupportedToolChoiceError. The supports_tool_choice(mode) method allows checking capabilities upfront.
Extracted from responses and produced by execute handlers:
RECORD ToolCall:
id : String -- unique identifier (provider-assigned)
name : String -- tool name
arguments : Dict -- parsed JSON arguments
raw_arguments : String | None -- raw argument string before parsing
RECORD ToolResult:
tool_call_id : String -- correlates to ToolCall.id
content : String | Dict | List -- the tool's output
is_error : Boolean -- true if the tool execution failed
Active tools have an execute handler. When used with generate() or stream(), the library automatically executes them and loops until the model produces a final text response.
Passive tools have no execute handler. Tool calls are returned to the caller in the response, and the caller manages the execution loop manually using Client.complete().
Passive tools are useful when:
- Tool execution requires external coordination (human approval, external orchestration).
- The calling code has its own loop and state management.
- Tools need to be executed in a specific order or with side effects between them.
When generate() is called with active tools, the following loop executes:
FUNCTION tool_loop(request, tools, max_tool_rounds, stop_when):
conversation = request.messages
steps = []
FOR round_num FROM 0 TO max_tool_rounds:
response = client.complete(request_with(conversation))
tool_calls = response.tool_calls
-- Execute tools if the model wants to call them
IF tool_calls AND response.finish_reason.reason == "tool_calls":
tool_results = execute_all_tools(tools, tool_calls) -- concurrent
ELSE:
tool_results = []
step = StepResult(response, tool_calls, tool_results, ...)
steps.APPEND(step)
-- Check stop conditions
IF tool_calls is empty OR response.finish_reason.reason != "tool_calls":
BREAK -- model is done (natural completion)
IF round_num >= max_tool_rounds:
BREAK -- budget exhausted
IF stop_when is not None AND stop_when(steps) == true:
BREAK -- custom stop condition met
-- Continue conversation with tool results
conversation.APPEND(response.message) -- assistant message with tool calls
FOR EACH result IN tool_results:
conversation.APPEND(Message.tool_result(
tool_call_id = result.tool_call_id,
content = result.content,
is_error = result.is_error
))
RETURN GenerateResult from steps
When the model returns multiple tool calls in a single response, they are logically independent (the model generated them simultaneously without seeing any results). The library MUST handle this correctly:
- Execute all tool calls concurrently. Launch all execute handlers simultaneously (using async tasks, threads, or equivalent concurrency primitive).
- Wait for ALL results before continuing. Do not send partial results back to the model. The continuation request must include results for every tool call from the previous response.
- Send all results in a single continuation request. Bundle all tool results into the message history and make one LLM call, not one call per result.
- Preserve ordering. Tool results should appear in the same order as the corresponding tool calls, even though execution may complete out of order.
- Handle partial failures gracefully. If some tool executions succeed and others fail, send all results (with
is_error = truefor failures). Do not abort the entire batch because one tool failed.
FUNCTION execute_all_tools(tools, tool_calls):
-- Launch all executions concurrently
futures = []
FOR EACH call IN tool_calls:
tool = find_tool(tools, call.name)
IF tool AND tool.execute:
futures.APPEND(async_execute(tool.execute, call.arguments, call.id))
ELSE:
futures.APPEND(immediate_error(call.id, "Unknown tool: " + call.name))
-- Wait for ALL to complete
results = AWAIT_ALL(futures)
RETURN results -- List<ToolResult>, one per tool_call, in order
This is critical for downstream consumers like coding agents. When a model asks to read three files simultaneously, the SDK handles the concurrent execution and result batching so the coding agent's agentic loop does not have to manage it.
Before passing arguments to the execute handler, the library:
- Parses the JSON argument string.
- Optionally validates against the tool's parameter schema.
- If validation fails and a
repair_tool_callfunction is provided, attempts repair (e.g., ask the model to fix the arguments). - If repair fails or is not configured, sends an error result to the model.
Unknown tool calls: When the model calls a tool not in the definitions, the library sends an error result rather than raising an exception. This gives the model a chance to correct its behavior.
When streaming with active tools, the stream emits tool call events as they form. Between steps (after tool execution, before the next model call), a step_finish event is emitted. The consumer sees a continuous stream of events spanning multiple steps.
How tool results are translated to each provider's format:
| SDK Format | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| TOOL role message with ToolResultData | Separate tool messages with tool_call_id |
tool_result content blocks in user message |
functionResponse parts in user content |
All library errors inherit from a single base:
RECORD SDKError:
message : String -- human-readable description
cause : Exception | None -- underlying exception, if any
Error hierarchy:
SDKError
+-- ProviderError -- errors from the LLM provider
| +-- AuthenticationError -- 401: invalid API key, expired token
| +-- AccessDeniedError -- 403: insufficient permissions
| +-- NotFoundError -- 404: model not found, endpoint not found
| +-- InvalidRequestError -- 400: malformed request, invalid parameters
| +-- RateLimitError -- 429: rate limit exceeded
| +-- ServerError -- 500-599: provider internal error
| +-- ContentFilterError -- response blocked by safety filter
| +-- ContextLengthError -- input + output exceeds context window
| +-- QuotaExceededError -- billing/usage quota exhausted
+-- RequestTimeoutError -- request or stream timed out
+-- AbortError -- request cancelled via abort signal
+-- NetworkError -- network-level failure
+-- StreamError -- error during stream consumption
+-- InvalidToolCallError -- tool call arguments failed validation
+-- NoObjectGeneratedError -- structured output parsing/validation failed
+-- ConfigurationError -- SDK misconfiguration (missing provider, etc.)
Note: Error class names are chosen to avoid shadowing common language built-in names (e.g., AccessDeniedError instead of PermissionError, NetworkError instead of ConnectionError, RequestTimeoutError instead of TimeoutError).
RECORD ProviderError extends SDKError:
provider : String -- which provider returned the error
status_code : Integer | None -- HTTP status code, if applicable
error_code : String | None -- provider-specific error code
retryable : Boolean -- whether this error is safe to retry
retry_after : Float | None -- seconds to wait before retrying
raw : Dict | None -- raw error response body from the provider
Every error carries a retryable property.
Non-retryable errors (client mistakes -- retrying will not help):
| Error | Status Code | Retryable |
|---|---|---|
| AuthenticationError | 401 | false |
| AccessDeniedError | 403 | false |
| NotFoundError | 404 | false |
| InvalidRequestError | 400, 422 | false |
| ContextLengthError | 413 | false |
| QuotaExceededError | (varies) | false |
| ContentFilterError | (varies) | false |
| ConfigurationError | (N/A) | false |
Retryable errors (transient -- may succeed on retry):
| Error | Status Code | Retryable |
|---|---|---|
| RateLimitError | 429 | true |
| ServerError | 500-504 | true |
| RequestTimeoutError | 408 | true |
| NetworkError | (N/A) | true |
| StreamError | (N/A) | true |
Unknown errors default to retryable. This is a deliberate conservative choice: transient network issues and novel provider error codes are more common than permanent failures from unexpected codes. A false retry is cheaper than a false abort.
Adapters map HTTP status codes to error types using this table:
| Status | Error Type | Retryable |
|---|---|---|
| 400 | InvalidRequestError | false |
| 401 | AuthenticationError | false |
| 403 | AccessDeniedError | false |
| 404 | NotFoundError | false |
| 408 | RequestTimeoutError | true |
| 413 | ContextLengthError | false |
| 422 | InvalidRequestError | false |
| 429 | RateLimitError | true |
| 500 | ServerError | true |
| 502 | ServerError | true |
| 503 | ServerError | true |
| 504 | ServerError | true |
For Gemini (which may use gRPC status codes):
| gRPC Code | Error Type |
|---|---|
| NOT_FOUND | NotFoundError |
| INVALID_ARGUMENT | InvalidRequestError |
| UNAUTHENTICATED | AuthenticationError |
| PERMISSION_DENIED | AccessDeniedError |
| RESOURCE_EXHAUSTED | RateLimitError |
| UNAVAILABLE | ServerError |
| DEADLINE_EXCEEDED | RequestTimeoutError |
| INTERNAL | ServerError |
For ambiguous cases where the status code alone is insufficient, the adapter checks the error message body for classification signals:
- Messages containing "not found" or "does not exist" -> NotFoundError
- Messages containing "unauthorized" or "invalid key" -> AuthenticationError
- Messages containing "context length" or "too many tokens" -> ContextLengthError
- Messages containing "content filter" or "safety" -> ContentFilterError
RECORD RetryPolicy:
max_retries : Integer = 2 -- total retry attempts (not counting initial)
base_delay : Float = 1.0 -- initial delay in seconds
max_delay : Float = 60.0 -- maximum delay between retries
backoff_multiplier : Float = 2.0 -- exponential backoff factor
jitter : Boolean = true -- add random jitter to prevent thundering herd
on_retry : Callback | None -- called before each retry with (error, attempt, delay)
The delay for attempt n (0-indexed) is calculated as:
delay = MIN(base_delay * (backoff_multiplier ^ n), max_delay)
IF jitter:
delay = delay * RANDOM(0.5, 1.5) -- +/- 50% jitter
Example delays with defaults (base=1.0, multiplier=2.0, max=60.0):
| Attempt | Base Delay | With Jitter (approx range) |
|---|---|---|
| 0 | 1.0s | 0.5s -- 1.5s |
| 1 | 2.0s | 1.0s -- 3.0s |
| 2 | 4.0s | 2.0s -- 6.0s |
| 3 | 8.0s | 4.0s -- 12.0s |
| 4 | 16.0s | 8.0s -- 24.0s |
When the provider returns a Retry-After header (common with 429 responses):
- If
Retry-Afteris less thanmax_delay, use the provider's delay instead of the calculated backoff. - If
Retry-Afterexceedsmax_delay, do NOT retry. Raise the error immediately withretry_afterset on the exception. This prevents silently waiting minutes for a rate limit to clear.
Retries apply to individual LLM calls, not to entire multi-step operations:
generate()with tools: Each step's LLM call is retried independently. A retry on step 3 does not re-execute steps 1 and 2.stream(): Only the initial connection is retried. Once streaming has begun and partial data has been delivered, the library does not retry. Instead, the stream emits an error event.generate_object(): The LLM call is retried. Schema validation failures are NOT retried (they indicate a model behavior issue, not a transient error).
Provider adapters do NOT retry by default. Retry logic lives in Layer 2 (provider utilities) and is applied by the high-level functions in Layer 4. Low-level Client.complete() and Client.stream() never retry automatically. Applications using the low-level API can compose retry behavior using a standalone retry() utility:
response = retry(
FUNCTION: client.complete(request),
policy = RetryPolicy(max_retries = 3)
)
Set max_retries = 0 to disable automatic retries in high-level functions.
When a provider returns HTTP 429, the library raises RateLimitError with retry_after extracted from the response header and retryable = true. With automatic retries enabled, rate limits are handled transparently up to the retry budget.
For applications that need proactive rate limiting (staying under limits rather than hitting them), use middleware:
FUNCTION rate_limit_middleware(request, next):
token_bucket.acquire() -- block until budget available
RETURN next(request)
This section provides detailed guidance for implementing a provider adapter. It is intended as a reference for anyone adding support for a new provider.
Each adapter must implement:
INTERFACE ProviderAdapter:
PROPERTY name : String
FUNCTION complete(request: Request) -> Response
FUNCTION stream(request: Request) -> AsyncIterator<StreamEvent>
Recommended optional methods:
FUNCTION close() -> Void
FUNCTION initialize() -> Void
FUNCTION supports_tool_choice(mode: String) -> Boolean
The adapter must translate a unified Request into the provider's native API format. The general steps are:
-
Extract system messages. For Anthropic: extract from message list, pass as
systemparameter. For Gemini: extract and pass assystemInstruction. For OpenAI (Responses API): extract and pass asinstructionsparameter. -
Translate messages. Convert each Message and its ContentParts to the provider's format.
-
Translate tools. Convert Tool definitions to the provider's tool format.
-
Translate tool choice. Map the unified ToolChoice to the provider's format.
-
Set generation parameters. Map temperature, top_p, max_tokens, stop_sequences, etc.
-
Apply response format. Translate ResponseFormat to the provider's structured output mechanism.
-
Apply provider options. Merge any provider-specific options from
request.provider_options[provider_name]into the request body.
The Responses API uses a different message format than Chat Completions. Messages are passed in an input array rather than a messages array:
Unified Role -> Responses API Handling
SYSTEM -> Extracted to `instructions` parameter
USER -> input item: { "type": "message", "role": "user", "content": [...] }
ASSISTANT -> input item: { "type": "message", "role": "assistant", "content": [...] }
TOOL -> input item: { "type": "function_call_output", "call_id": "...", "output": "..." }
DEVELOPER -> Extracted to `instructions` parameter (or `developer` role input item)
ContentPart Translations:
TEXT -> { "type": "input_text", "text": "..." } (user) or { "type": "output_text", "text": "..." } (assistant)
IMAGE (url) -> { "type": "input_image", "image_url": "..." }
IMAGE (data) -> { "type": "input_image", "image_url": "data:<mime>;base64,<data>" }
TOOL_CALL -> input item: { "type": "function_call", "id": "...", "name": "...", "arguments": "..." }
TOOL_RESULT -> input item: { "type": "function_call_output", "call_id": "...", "output": "..." }
Special behaviors:
- System messages are extracted to the
instructionsparameter, not included in theinputarray. - The
reasoning.effortparameter controls reasoning for o-series models ("low", "medium", "high"). - Tool calls and results are top-level input items, not nested within messages.
- For third-party OpenAI-compatible endpoints, use the Chat Completions format instead (see Section 7.10).
Unified Role -> Anthropic Handling
SYSTEM -> Extracted to `system` parameter (not in messages array)
DEVELOPER -> Merged with system parameter
USER -> "user" role
ASSISTANT -> "assistant" role
TOOL -> "user" role with tool_result content blocks
ContentPart Translations:
TEXT -> { "type": "text", "text": "..." }
IMAGE (url) -> { "type": "image", "source": { "type": "url", "url": "..." } }
IMAGE (data) -> { "type": "image", "source": { "type": "base64", "media_type": "...", "data": "..." } }
TOOL_CALL -> { "type": "tool_use", "id": "...", "name": "...", "input": { ... } }
TOOL_RESULT -> { "type": "tool_result", "tool_use_id": "...", "content": "...", "is_error": ... }
THINKING -> { "type": "thinking", "thinking": "...", "signature": "..." }
REDACTED_THINKING -> { "type": "redacted_thinking", "data": "..." }
Special behaviors:
- Strict alternation: Anthropic requires alternating user/assistant messages. The adapter must merge consecutive same-role messages by combining their content arrays.
- Tool results in user messages: Anthropic requires tool results to appear in user-role messages, not a separate "tool" role.
- Thinking block round-tripping: Thinking and redacted_thinking blocks from previous responses must be preserved exactly as received and included in subsequent assistant messages.
- max_tokens is required: Anthropic always requires
max_tokens. Default to 4096 if not specified.
Unified Role -> Gemini Handling
SYSTEM -> Extracted to `systemInstruction` field
DEVELOPER -> Merged with systemInstruction
USER -> "user" role
ASSISTANT -> "model" role
TOOL -> "user" role with functionResponse parts
ContentPart Translations:
TEXT -> { "text": "..." }
IMAGE (url) -> { "fileData": { "mimeType": "...", "fileUri": "..." } }
IMAGE (data) -> { "inlineData": { "mimeType": "...", "data": "<base64>" } }
TOOL_CALL -> { "functionCall": { "name": "...", "args": { ... } } }
TOOL_RESULT -> { "functionResponse": { "name": "<function_name>", "response": { ... } } }
Special behaviors:
- No developer role: Treated the same as system.
- Tool call IDs: Gemini does not assign unique IDs to function calls. The adapter must generate synthetic unique IDs (e.g.,
"call_" + random_uuid()) and maintain a mapping from synthetic IDs to function names for when tool results are sent back. - Function response format: Gemini's
functionResponseuses the function name (not the call ID) and expects a dict for the response (wrap strings in{"result": "..."}if needed). - Streaming format: Gemini uses JSON chunks (optionally via SSE with
?alt=sse), not a standard SSE endpoint.
| SDK Format | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| Tool.name | tools[].function.name | tools[].name | tools[].functionDeclarations[].name |
| Tool.description | tools[].function.description | tools[].description | tools[].functionDeclarations[].description |
| Tool.parameters | tools[].function.parameters | tools[].input_schema | tools[].functionDeclarations[].parameters |
| Wrapper structure | {"type":"function","function":{...}} |
{"name":...,"description":...,"input_schema":...} |
{"functionDeclarations":[{...}]} |
The adapter must parse the provider's response into the unified Response format:
- Extract content parts. Parse the provider's content/parts array into
List<ContentPart>with appropriateContentKindtags. - Map finish reason. Translate the provider's finish/stop reason to the unified
FinishReason(see mapping table in Section 3.8). - Extract usage. Map the provider's token count fields to
Usage(see mapping table in Section 3.9). - Preserve raw response. Store the complete provider response in
Response.rawfor debugging. - Extract rate limit info. Parse
x-ratelimit-*headers intoRateLimitInfoif present.
The adapter must translate HTTP errors into the error hierarchy:
- Parse the response body for error details (message, error code).
- Extract
Retry-Afterheader if present. - Map the HTTP status code to the appropriate error type using the table in Section 6.4.
- For ambiguous cases, apply message-based classification (Section 6.5).
- Preserve the raw error response in the
rawfield.
FUNCTION raise_error(http_response):
body = parse_json(http_response.body)
message = body.error.message OR http_response.text
error_code = body.error.code OR body.error.type
retry_after = None
IF http_response.headers["retry-after"] EXISTS:
retry_after = parse_float(http_response.headers["retry-after"])
RAISE error_from_status_code(
status_code = http_response.status,
message = message,
provider = self.name,
error_code = error_code,
raw = body,
retry_after = retry_after
)
The adapter translates provider-specific streaming formats into the unified StreamEvent model.
Most providers use Server-Sent Events (SSE). A proper SSE parser must handle:
event:lines (event type)data:lines (payload, may span multiple lines)retry:lines (reconnection interval)- Comment lines (starting with
:) - Blank lines (event boundary)
The parser yields (event_type, data) tuples. Many providers include the event type in the JSON payload as well as in the SSE event field; prefer the JSON payload field for reliability.
The Responses API uses a different streaming format than Chat Completions:
Provider Format (Responses API):
event: response.created -- response object created
event: response.in_progress -- generation started
event: response.output_text.delta -- incremental text
event: response.function_call_arguments.delta -- incremental tool call args
event: response.output_item.done -- output item complete
event: response.completed -- generation complete, includes usage with reasoning_tokens
Translation:
output_text.delta -> TEXT_DELTA event (emit TEXT_START on first)
function_call_arguments.delta -> TOOL_CALL_DELTA event
output_item.done (text) -> TEXT_END event
output_item.done (function) -> TOOL_CALL_END event
response.completed -> FINISH event with usage (including reasoning_tokens)
The Responses API streaming format provides reasoning token counts in the final response.completed event, which is why it is required for reasoning models.
For the OpenAI-compatible adapter (Chat Completions), the streaming format is:
Provider Format (Chat Completions, for third-party endpoints):
data: {"choices": [{"delta": {"content": "text"}, "finish_reason": null}]}
data: {"choices": [{"delta": {"tool_calls": [{"index": 0, ...}]}}]}
data: {"usage": {...}}
data: [DONE]
Provider Format (SSE events):
event: message_start -- contains message metadata and input token count
event: content_block_start -- new content block (text, tool_use, thinking)
event: content_block_delta -- incremental content within a block
event: content_block_stop -- block complete
event: message_delta -- finish reason and output usage
event: message_stop -- stream complete
Translation:
content_block_start (type=text) -> TEXT_START
content_block_delta (type=text) -> TEXT_DELTA
content_block_stop (type=text) -> TEXT_END
content_block_start (type=tool_use) -> TOOL_CALL_START
content_block_delta (type=tool_use) -> TOOL_CALL_DELTA
content_block_stop (type=tool_use) -> TOOL_CALL_END
content_block_start (type=thinking) -> REASONING_START
content_block_delta (type=thinking) -> REASONING_DELTA
content_block_stop (type=thinking) -> REASONING_END
message_stop -> FINISH with accumulated response
Gemini uses SSE (with ?alt=sse query parameter) or newline-delimited JSON chunks.
Provider Format (SSE):
data: {"candidates": [{"content": {"parts": [{"text": "..."}]}}], "usageMetadata": {...}}
Translation:
parts[].text present -> TEXT_DELTA (emit TEXT_START on first)
parts[].functionCall present -> TOOL_CALL_START + TOOL_CALL_END (full call in one chunk)
candidate.finishReason present -> TEXT_END
Final chunk -> FINISH with accumulated response
Note: Gemini typically delivers function calls as complete objects in a single chunk, not incrementally. Emit both TOOL_CALL_START and TOOL_CALL_END for each function call.
A summary of provider-specific behaviors that adapters must handle:
| Concern | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| Native API | Responses API (/v1/responses) |
Messages API (/v1/messages) |
Gemini API (/v1beta/...generateContent) |
| System message handling | instructions parameter |
Extracted to system parameter |
Extracted to systemInstruction |
| Developer role | instructions or developer role |
Merged with system | Merged with system |
| Message alternation | No strict requirement | Strict user/assistant alternation | No strict requirement |
| Reasoning tokens | Via output_tokens_details; requires Responses API |
Via thinking blocks (text visible) | Via thoughtsTokenCount |
| Tool call IDs | Provider-assigned unique IDs | Provider-assigned unique IDs | No unique IDs (use function name) |
| Tool result format | Separate tool role messages |
tool_result blocks in user messages |
functionResponse in user content |
| Tool choice "none" | "none" |
Omit tools from request entirely | "NONE" |
| max_tokens | Optional | Required (default to 4096) | Optional (as maxOutputTokens) |
| Thinking blocks | Not exposed (o-series internal) | thinking / redacted_thinking blocks |
thought parts (2.5 models) |
| Structured output | Native json_schema mode | Prompt engineering or tool extraction | Native responseSchema |
| Streaming protocol | SSE with data: lines |
SSE with event type + data lines | SSE (with ?alt=sse) or JSON |
| Stream termination | data: [DONE] |
message_stop event |
Final chunk (no explicit signal) |
| Finish reason for tools | tool_calls |
tool_use |
No dedicated reason (infer from parts) |
| Image input | Data URI in image_url |
base64 source with media_type |
inlineData with mimeType |
| Prompt caching | Automatic (free, 50% discount) | Requires explicit cache_control blocks (90% discount) |
Automatic (free prefix caching) |
| Beta/feature headers | N/A (features in request body) | anthropic-beta header (comma-separated) |
N/A (features in request body) |
| Authentication | Bearer token in Authorization | x-api-key header |
key query parameter |
| API versioning | Via URL path (/v1/) | anthropic-version header |
Via URL path (/v1beta/) |
To add support for a new provider:
- Implement the ProviderAdapter interface. Create a class with
name,complete(), andstream(). - Write request translation. Map the unified Request to the provider's API format, following the patterns in Section 7.3.
- Write response translation. Map the provider's response to the unified Response, following Section 7.5.
- Write error translation. Map HTTP errors to the error hierarchy, following Section 7.6.
- Write streaming translation. Map the provider's streaming format to StreamEvent objects, following Section 7.7.
- Handle provider quirks. Document any provider-specific behaviors (like Anthropic's strict alternation or Gemini's missing tool call IDs) and handle them in the adapter.
- Register the adapter. Add it to
Client.from_env()with the appropriate environment variable checks, or allow users to register it programmatically.
Many third-party services (vLLM, Ollama, Together AI, Groq, etc.) expose an OpenAI-compatible Chat Completions API. For these services, provide a separate OpenAICompatibleAdapter that uses the Chat Completions endpoint (/v1/chat/completions) rather than the Responses API:
adapter = OpenAICompatibleAdapter(
api_key = "...",
base_url = "https://my-vllm-instance.example.com/v1"
)
This adapter is distinct from the primary OpenAI adapter (which uses the Responses API) because third-party services typically only implement the Chat Completions protocol. The compatible adapter does not support reasoning tokens, built-in tools, or other Responses API features.
messages = [
Message(role = SYSTEM, content = [ContentPart(kind = TEXT, text = "You are a helpful assistant.")]),
Message(role = USER, content = [ContentPart(kind = TEXT, text = "What is 2 + 2?")])
]
messages = [
Message(role = USER, content = [
ContentPart(kind = TEXT, text = "What do you see in this image?"),
ContentPart(kind = IMAGE, image = ImageData(url = "https://example.com/photo.jpg"))
])
]
messages = [
Message(role = USER, content = [
ContentPart(kind = TEXT, text = "What is the weather in San Francisco?")
]),
Message(role = ASSISTANT, content = [
ContentPart(kind = TOOL_CALL, tool_call = ToolCallData(
id = "call_123",
name = "get_weather",
arguments = { "city": "San Francisco" }
))
]),
Message(role = TOOL, content = [
ContentPart(kind = TOOL_RESULT, tool_result = ToolResultData(
tool_call_id = "call_123",
content = "72F, sunny",
is_error = false
))
], tool_call_id = "call_123"),
Message(role = ASSISTANT, content = [
ContentPart(kind = TEXT, text = "The weather in San Francisco is 72F and sunny.")
])
]
messages = [
Message(role = USER, content = [
ContentPart(kind = TEXT, text = "Solve this complex math problem...")
]),
Message(role = ASSISTANT, content = [
ContentPart(kind = THINKING, thinking = ThinkingData(
text = "Let me work through this step by step...",
signature = "sig_abc123"
)),
ContentPart(kind = TEXT, text = "The answer is 42.")
])
]
When continuing a conversation that includes thinking blocks, the thinking content parts must be included in the message history so the provider can verify their integrity.
result = generate(model = "claude-opus-4-6", prompt = "Explain quantum computing")
PRINT(result.text)
PRINT(result.usage.total_tokens)
result = generate(
model = "claude-opus-4-6",
system = "You are a helpful assistant with access to weather data.",
prompt = "What is the weather in San Francisco?",
tools = [weather_tool],
max_tool_rounds = 5
)
PRINT(result.text) -- final text after all tool rounds
PRINT(LENGTH(result.steps)) -- number of steps taken
PRINT(result.total_usage.total_tokens) -- aggregated token count
result = stream(model = "claude-opus-4-6", prompt = "Write a poem")
FOR EACH event IN result:
IF event.type == TEXT_DELTA:
PRINT(event.delta)
response = result.response()
PRINT(response.usage)
result = generate_object(
model = "gpt-5.2",
prompt = "Extract the person's name and age from: 'Alice is 30 years old'",
schema = {
"type": "object",
"properties": {
"name": { "type": "string" },
"age": { "type": "integer" }
},
"required": ["name", "age"]
}
)
PRINT(result.output) -- { "name": "Alice", "age": 30 }
TRY:
result = generate(model = "claude-opus-4-6", prompt = "...")
CATCH ProviderError:
result = generate(model = "gpt-5.2", provider = "openai", prompt = "...")
FUNCTION logging_middleware(request, next):
start_time = NOW()
LOG_INFO("LLM request: provider=" + request.provider + " model=" + request.model)
response = next(request)
elapsed = NOW() - start_time
LOG_INFO("LLM response: tokens=" + response.usage.total_tokens + " latency=" + elapsed)
RETURN response
client = Client(
providers = { "anthropic": AnthropicAdapter(...) },
middleware = [logging_middleware]
)
This appendix summarizes key design decisions and the reasoning behind them. These are provided so that implementors understand the "why" and can make informed tradeoffs if their language or context demands different choices.
Why a single Request type instead of per-method parameter lists? A single Request object is easier to construct, pass around, modify, and serialize than many keyword arguments. It enables middleware to inspect and modify requests uniformly. High-level functions like generate(model=..., prompt=...) provide ergonomic shorthand.
Why ship a model catalog if model strings work as-is? Model strings work for developers who know which models exist. But AI coding agents building on top of this SDK often hallucinate model identifiers from stale training data. The catalog gives them a reliable, up-to-date source of valid model IDs and capabilities. Unknown model strings still pass through -- the catalog is advisory, not restrictive.
Why explicit provider on Request instead of model-based routing? Several providers serve models with overlapping names. Explicit routing avoids ambiguity. For the common case, default_provider removes boilerplate.
Why separate generate() and stream()? The return types are fundamentally different: GenerateResult vs StreamResult. A boolean flag loses type safety.
Why start/delta/end events instead of flat deltas? Flat deltas lose structural information when a response contains multiple text segments or interleaved tool calls. The pattern adds minimal overhead but enables correct handling of complex responses.
Why max_tool_rounds instead of unlimited looping? Unbounded loops risk infinite cycles. A default of 1 is safe. Higher values are an explicit opt-in.
Why JSON Schema for tool parameters instead of language-native types? JSON Schema is the universal parameter description format across all providers. Language-native helpers can generate JSON Schema, but JSON Schema is the canonical format.
Why send error results to the model instead of raising exceptions? Raising on tool failure aborts the entire generation. Sending an error result gives the model the opportunity to retry, use a different tool, or explain the failure.
Why default to retrying unknown errors? Transient failures are more common than permanent ones from unexpected codes. A false retry is cheaper than a false abort.
Why not retry timed-out requests by default? Timeouts indicate the operation is inherently slow, not that it failed transiently. Applications can opt in to timeout retries.
Why use each provider's native API instead of just targeting Chat Completions everywhere? The Chat Completions API is an OpenAI-specific protocol that other providers partially mimic as a convenience shim. Using it as the universal transport loses critical capabilities: OpenAI's own Responses API exposes reasoning tokens that Chat Completions hides; Anthropic's Messages API supports thinking blocks, prompt caching, and beta headers; Gemini's native API supports grounding and code execution. The unified SDK's value is precisely in abstracting over these different native APIs so callers don't have to. Using a compatibility layer would defeat the purpose.
Why handle parallel tool execution in the SDK instead of leaving it to the caller? When a model returns 5 parallel tool calls, the correct behavior is to execute all 5 concurrently, wait for all to complete, and send all 5 results back in one continuation. This is fiddly to implement correctly (error handling, ordering, timeout management) and identical for every consumer. Doing it once in the SDK means coding agents and other downstream tools get it for free.
This section defines how to validate that an implementation of this spec is complete and correct. Use this as a checklist during development. An implementation is considered done when every item is checked off.
-
Clientcan be constructed from environment variables (Client.from_env()) -
Clientcan be constructed programmatically with explicit adapter instances - Provider routing works: requests are dispatched to the correct adapter based on
providerfield - Default provider is used when
provideris omitted from a request -
ConfigurationErroris raised when no provider is configured and no default is set - Middleware chain executes in correct order (request: registration order, response: reverse order)
- Module-level default client works (
set_default_client()and implicit lazy initialization) - Model catalog is populated with current models and
get_model_info()/list_models()return correct data
For EACH provider (OpenAI, Anthropic, Gemini), verify:
- Adapter uses the provider's native API (OpenAI: Responses API, Anthropic: Messages API, Gemini: Gemini API) -- NOT a compatibility shim
- Authentication works (API key from env var or explicit config)
-
complete()sends a request and returns a correctly populatedResponse -
stream()returns an async iterator of correctly typedStreamEventobjects - System messages are extracted/handled per provider convention
- All 5 roles (SYSTEM, USER, ASSISTANT, TOOL, DEVELOPER) are translated correctly
-
provider_optionsescape hatch passes through provider-specific parameters - Beta headers are supported (especially Anthropic's
anthropic-betaheader) - HTTP errors are translated to the correct error hierarchy types
-
Retry-Afterheaders are parsed and set on the error object
- Messages with text-only content work across all providers
- Image input works: images sent as URL, base64 data, and local file path are correctly translated per provider
- Audio and document content parts are handled (or gracefully rejected if provider doesn't support them)
- Tool call content parts round-trip correctly (assistant message with tool calls -> tool result messages -> next assistant message)
- Thinking blocks (Anthropic) are preserved and round-tripped with signatures intact
- Redacted thinking blocks are passed through verbatim
- Multimodal messages (text + images in the same message) work
-
generate()works with a simple textprompt -
generate()works with a fullmessageslist -
generate()rejects when bothpromptandmessagesare provided -
stream()yieldsTEXT_DELTAevents that concatenate to the full response text -
stream()yieldsSTREAM_STARTandFINISHevents with correct metadata - Streaming follows the start/delta/end pattern for text segments
-
generate_object()returns parsed, validated structured output -
generate_object()raisesNoObjectGeneratedErroron parse/validation failure - Cancellation via abort signal works for both
generate()andstream() - Timeouts work (total timeout and per-step timeout)
- OpenAI reasoning models (GPT-5.2 series, etc.) return
reasoning_tokensinUsagevia the Responses API -
reasoning_effortparameter is passed through correctly to OpenAI reasoning models - Anthropic extended thinking blocks are returned as
THINKINGcontent parts when enabled - Thinking block
signaturefield is preserved for round-tripping - Gemini thinking tokens (
thoughtsTokenCount) are mapped toreasoning_tokensinUsage -
Usagecorrectly reportsreasoning_tokensas distinct fromoutput_tokens
- OpenAI: caching works automatically via the Responses API (no client-side configuration needed)
- OpenAI:
Usage.cache_read_tokensis populated fromusage.prompt_tokens_details.cached_tokens - Anthropic: adapter automatically injects
cache_controlbreakpoints on the system prompt, tool definitions, and conversation prefix - Anthropic:
prompt-caching-2024-07-31beta header is included automatically when cache_control is present - Anthropic:
Usage.cache_read_tokensandUsage.cache_write_tokensare populated correctly - Anthropic: automatic caching can be disabled via
provider_options.anthropic.auto_cache = false - Gemini: automatic prefix caching works (no client-side configuration needed)
- Gemini:
Usage.cache_read_tokensis populated fromusageMetadata.cachedContentTokenCount - Multi-turn agentic session: verify that turn 5+ shows significant cache_read_tokens (>50% of input tokens) for all three providers
- Tools with
executehandlers (active tools) trigger automatic tool execution loops - Tools without
executehandlers (passive tools) return tool calls to the caller without looping -
max_tool_roundsis respected: loop stops after the configured number of rounds -
max_tool_rounds = 0disables automatic execution entirely - Parallel tool calls: when the model returns N tool calls in one response, all N are executed concurrently
- Parallel tool results: all N results are sent back in a single continuation request (not one at a time)
- Tool execution errors are sent to the model as error results (
is_error = true), not raised as exceptions - Unknown tool calls (model calls a tool not in definitions) send an error result, not an exception
-
ToolChoicemodes (auto, none, required, named) are translated correctly per provider - Tool call argument JSON is parsed and validated before passing to execute handlers
-
StepResultobjects track each step's tool calls, results, and usage
- All errors in the hierarchy are raised for the correct HTTP status codes (see Section 6.4 table)
-
retryableflag is set correctly on each error type - Exponential backoff with jitter works: delays increase correctly per attempt
-
Retry-Afterheader overrides calculated backoff when present (and withinmax_delay) -
max_retries = 0disables automatic retries - Rate limit errors (429) are retried transparently
- Non-retryable errors (401, 403, 404) are raised immediately without retry
- Retries apply per-step, not to the entire multi-step operation
- Streaming does not retry after partial data has been delivered
Run this validation matrix -- each cell must pass:
| Test Case | OpenAI | Anthropic | Gemini |
|---|---|---|---|
| Simple text generation | [ ] | [ ] | [ ] |
| Streaming text generation | [ ] | [ ] | [ ] |
| Image input (base64) | [ ] | [ ] | [ ] |
| Image input (URL) | [ ] | [ ] | [ ] |
| Single tool call + execution | [ ] | [ ] | [ ] |
| Multiple parallel tool calls | [ ] | [ ] | [ ] |
| Multi-step tool loop (3+ rounds) | [ ] | [ ] | [ ] |
| Streaming with tool calls | [ ] | [ ] | [ ] |
| Structured output (generate_object) | [ ] | [ ] | [ ] |
| Reasoning/thinking token reporting | [ ] | [ ] | [ ] |
| Error handling (invalid API key -> 401) | [ ] | [ ] | [ ] |
| Error handling (rate limit -> 429) | [ ] | [ ] | [ ] |
| Usage token counts are accurate | [ ] | [ ] | [ ] |
| Prompt caching (cache_read_tokens > 0 on turn 2+) | [ ] | [ ] | [ ] |
| Provider-specific options pass through | [ ] | [ ] | [ ] |
The ultimate validation: run this end-to-end test against all three providers with real API keys.
-- 1. Basic generation across all providers
FOR EACH provider IN ["anthropic", "openai", "gemini"]:
result = generate(
model = get_latest_model(provider).id,
prompt = "Say hello in one sentence.",
max_tokens = 100,
provider = provider
)
ASSERT result.text is not empty
ASSERT result.usage.input_tokens > 0
ASSERT result.usage.output_tokens > 0
ASSERT result.finish_reason.reason == "stop"
-- 2. Streaming
stream_result = stream(model = "claude-opus-4-6", prompt = "Write a haiku.")
text_chunks = []
FOR EACH event IN stream_result:
IF event.type == TEXT_DELTA:
text_chunks.APPEND(event.delta)
ASSERT JOIN(text_chunks) == stream_result.response().text
-- 3. Tool calling with parallel execution
result = generate(
model = "claude-opus-4-6",
prompt = "What is the weather in San Francisco and New York?",
tools = [weather_tool], -- tool that returns mock weather data
max_tool_rounds = 3
)
ASSERT LENGTH(result.steps) >= 2 -- at least: initial call + after tool results
ASSERT result.text contains "San Francisco"
ASSERT result.text contains "New York"
-- 4. Image input
result = generate(
model = "claude-opus-4-6",
messages = [Message(role=USER, content=[
ContentPart(kind=TEXT, text="What do you see?"),
ContentPart(kind=IMAGE, image=ImageData(data=<png_bytes>, media_type="image/png"))
])]
)
ASSERT result.text is not empty
-- 5. Structured output
result = generate_object(
model = "gpt-5.2",
prompt = "Extract: Alice is 30 years old",
schema = {"type":"object", "properties":{"name":{"type":"string"},"age":{"type":"integer"}}, "required":["name","age"]}
)
ASSERT result.output.name == "Alice"
ASSERT result.output.age == 30
-- 6. Error handling
TRY:
generate(model = "nonexistent-model-xyz", prompt = "test", provider = "openai")
FAIL("Should have raised an error")
CATCH NotFoundError:
PASS -- correct error type
If all items in this section are checked off, the unified LLM library is complete and ready for use as the foundation for a coding agent or any other LLM-powered application.