Module: cube_harness.llm
Thin wrapper over LiteLLM that standardizes prompt construction, retry behavior, and usage accounting. All LLM calls in the harness flow through this module — per the constitution, direct SDK use (OpenAI SDK, Anthropic SDK) is forbidden (PS-002).
class LLMConfig(TypedBaseModel):
model_name: str
temperature: float = 1.0
max_tokens: int = 128000
max_completion_tokens: int = 8192
reasoning_effort: Literal["minimal", "low", "medium", "high"] | None = None
tool_choice: Literal["auto", "none", "required"] | None = "auto" # None opts out
parallel_tool_calls: bool = False
num_retries: int = 5
retry_strategy: Literal["exponential_backoff_retry", "constant_retry"] = "exponential_backoff_retry"
timeout: float | None = 120.0 # seconds per attempt; None disables
set_cache_control: Literal["auto"] | None = None # Anthropic prompt caching, see "Caching"
def make(self) -> LLM
def make_counter(self) -> Callable[..., int] # partial(token_counter, model=model_name)class Prompt(TypedBaseModel):
messages: list[dict] # litellm.Message inputs are coerced via a
# field_validator (model_dump(exclude_none=True))
tools: list[dict] = []Callers may pass a mix of dict and litellm.Message objects — the validator
normalises to plain dicts at construction. This keeps serialisation noise-free
(Message's dynamic provider-specific fields like thinking_blocks,
reasoning_content would otherwise trip PydanticSerializationUnexpectedValue
on every model_dump) and gives downstream readers a single homogenous type
to work with.
class LLMResponse(TypedBaseModel):
message: Message # litellm.Message
usage: Usage
@property
def reasoning_text(self) -> str # provider-agnostic; empty when no reasoning
class Usage(TypedBaseModel):
prompt_tokens: int = 0
completion_tokens: int = 0
total_tokens: int = 0
cached_tokens: int = 0
cache_creation_tokens: int = 0 # Anthropic prompt caching
reasoning_tokens: int = 0 # LiteLLM-normalized across providers; subset of completion_tokens
cost: float = 0.0 # USD from LiteLLM pricingModule-level helper. Provider-agnostic reasoning extractor:
msg.reasoning_contentif non-empty (OpenAI / streaming).- Concatenation of
msg.thinking_blocks[*].thinking(Anthropic extended thinking). - Fallback to
msg.content, else empty string.
Works on any litellm.Message, including those reconstructed from persisted
LLMCall.output records — so it's the canonical reasoning extractor for both
live runs (LLMResponse.reasoning_text) and offline trajectory analysis.
class LLM:
def __init__(self, config: LLMConfig)
def __call__(self, prompt: Prompt) -> LLMResponse
# Uses litellm.completion_with_retries under the hood with config.retry_strategy.class LLMCall(TypedBaseModel):
id: str = field(default_factory=lambda: str(uuid4()))
tag: str | None = None # e.g. "act", "summary", "criticise"
timestamp: datetime
config: LLMConfig
prompt: Prompt
output: Message
usage: Usage | None = NoneCaptured in AgentOutput.llm_calls. Agents MUST set tag to distinguish multi-call
steps in traces and training data.
- All LLM calls route through
LLM.__call__— no direct use oflitellm.completionin the harness code. - Retry strategy is determined by
LLMConfig, not the call site. LLMCall.tagis the primary way to correlate multiple LLM calls in one agent step.- Module-level
litellm.callbacksis intentionally NOT set. OTel callbacks are attached only after a properTracerProvideris configured (see metrics spec) — otherwise litellm's default console exporter floods stdout. - Reasoning round-trip.
Prompt._coerce_messagesMUST preserve provider reasoning fields on assistant messages so they can be re-sent in subsequent calls. Specifically:thinking_blocks(including each block'ssignature),reasoning_content, andtool_callssurvive coercion fromlitellm.Messageto dict. Anthropic extended thinking with tool use requires the prior turn'sthinking_blocksto be echoed back; stripping them breaks the tool-use loop. - Anthropic thinking + temperature.
LLMConfigrejects construction whenreasoning_effortis set on an Anthropic model withtemperature != 1.0. Anthropic forbids non-unit temperature under extended thinking; the validator surfaces this at config time rather than at API time.
When LLMConfig.set_cache_control == "auto" and the configured model routes to
Anthropic (direct, Bedrock, or Vertex — detected via litellm.get_llm_provider
with a substring fallback for model names LiteLLM's registry hasn't catalogued
yet), LLM.__call__ places ephemeral cache_control breakpoints at:
- Message index 1 — the goal / first large user observation. Stable anchor that lifts the cached prefix above Anthropic's 1024-token minimum (the system message alone is typically under that floor).
- Last assistant message — rolling per-step boundary. Each new step extends the cached prefix by one (obs, asst) pair via Anthropic's longest-prefix match.
- Last tool definition — caches the entire tools array prefix.
Breakpoint injection is done via LiteLLM's cache_control_injection_points
hook (canonical public API; LiteLLM handles the wire-format reshape into
Anthropic's content-block-with-cache_control structure). For non-Anthropic
models the flag is a no-op — the payload is never emitted.
Usage.cached_tokens / Usage.cache_creation_tokens are populated from the
Anthropic response so trace consumers can see cache-hit rates per step.
- Agent implementations build a
Promptand callself.llm(prompt). Record the call:call = LLMCall(tag="act", config=self.config.llm_config, prompt=prompt, output=resp.message, usage=resp.usage) output.llm_calls.append(call)
- For multi-model agents, use one
LLMper model — the class holds a single config. - Pass a token counter from
config.make_counter()for prompt-size budgeting.
completion_with_retriesreturns on first success, but retries count toward the per-attempt timeout. Total call time is bounded bynum_retries * timeoutin the worst case.Prompt.messagesaccepts both dicts andlitellm.Messageobjects; thefield_validatorcoerces Messages to dicts at construction so the stored type is alwayslist[dict]. Downstream readers don't need to handle the union.- Reasoning extraction. Set
reasoning_effortto activate native reasoning on supported models (OpenAI o-series / gpt-5; Anthropic Claude 3.7+/4.x; Gemini 2.5; Grok 3/4; DeepSeek R1/R2; Qwen3-thinking; Magistral). Useresponse.reasoning_text(orget_reasoning(msg)for offline analysis) to obtain the thinking string forAgentOutput.thoughts. The structured form is preserved onresponse.message.thinking_blocks/reasoning_contentfor round-trip. - OpenAI hides the reasoning text. OpenAI o-series and gpt-5 (including
Azure-OpenAI deployments) return
reasoning_tokens > 0to confirm the model thought, but do not return the thinking text —reasoning_contentandthinking_blocksare empty even withreasoning_effortset. This is an OpenAI design choice. Consequence:AgentOutput.thoughtswill beNoneon OpenAI episodes even when the model reasoned. The agent still benefits from the reasoning; only the human-readable trace is unavailable. - Anthropic thinking constraints. Two API-level restrictions surface when
reasoning_effortis set on an Anthropic model:temperaturemust be1.0.LLMConfigvalidates this at construction time.tool_choicemust NOT be"required". Anthropic returns 400 with "Thinking may not be enabled when tool_choice forces tool use." Stay on"auto"(the default) and shape the prompt to elicit the tool call.max_completion_tokensmust exceed the thinkingbudget_tokensLiteLLM mapsreasoning_effortonto (≈1024 for "low", more for higher). Setmax_completion_tokensto at least 2048 when reasoning is active.
- Tool-use loops with Anthropic thinking. Each assistant turn's
thinking_blocks(includingsignature) MUST be echoed back in subsequent calls.Prompt._coerce_messagespreserves them automatically viaMessage.model_dump(exclude_none=True). Do not strip these fields. reasoning_tokensaccounting. LiteLLM normalizesreasoning_tokensfor both OpenAI (nativecompletion_tokens_details.reasoning_tokens) and Anthropic (computed fromthinking_blocks) into the sameUsagefield. These tokens are already counted insidecompletion_tokens— do not add them to a budget tally, or you will double-count. The field exists for telemetry only.- Cost is USD from LiteLLM's built-in pricing — may lag behind provider price changes.