feat(agent-platform): model selection via spec.models (auto/manual) + runtime fallback#64825
feat(agent-platform): model selection via spec.models (auto/manual) + runtime fallback#64825benjackwhite wants to merge 38 commits into
Conversation
Move `encrypted_env` from AgentApplication to AgentRevision so each revision
runs against its own secret values: a draft can be edited/previewed with its
own secrets without touching the live revision's, a promote carries the
secrets it was tested with, and forking a draft copies the parent's forward.
Backend:
- models: drop AgentApplication.encrypted_env, add AgentRevision.encrypted_env
- migrations (3-step safe pattern): 0004 add revision column, 0005 copy the
application ciphertext onto its non-archived revisions, 0006 state-only
remove from application (DB column dropped in a follow-up after a deploy
cycle, per safe-django-migrations)
- env-keys endpoints (set_env + per-key get/put/delete + list) move from the
AgentApplicationViewSet to the AgentRevisionViewSet; the promote + validate
gates read the revision's own env block; new_draft copies the parent
revision's encrypted_env forward
Runtime (node services):
- the encrypted-env resolver reads off the session's revision, not the app
- SecretResolver / JwtSecretResolver resolve against a { encrypted_env }
source; authorize() + the auth verifiers + the slack trigger + the failure
notifier thread the resolved revision through
- NewRevision carries encrypted_env; test-reset schema + harness stamp
secrets on the revision
Generated: `hogli build:openapi` regenerated the agent_platform frontend types,
MCP orval schemas + tool definitions (env-keys operations now target the
revision), and the tool-schema snapshots.
Extracted from #64453 (dylan/agent-platform) as the standalone secrets-move
half; the preview-mode work stays on that PR, rebased on top of this.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ision Addresses a Greptile review note: the failure handler resolves the session's revision (the notifier reads its outbound secret off the revision's encrypted_env now) and guards `application && revision`. If the revision can't be loaded, the notification was dropped silently. Add an error-level log on the application-without-revision path so the skip is observable. Re-fetch (not cache from the main loop) stays intentional — this catch is reachable before the revision ever loaded (e.g. a revision_missing failure). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Auth app The `allow_agent_approver: False` decide gate required an OAuth bearer to carry the dedicated `agent_approvals:write` scope. Gate on the app's staff-set `is_first_party` flag instead: first-party PostHog clients (e.g. PostHog Code, where a human approves in-app) can decide, while third-party apps and personal API keys still can't — regardless of scope. PostHog Code no longer needs to request a dedicated scope (reverted client-side). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🕸️ Eager graphHow much code each root forces the browser to download and decode through static imports — the regression class total bundle size can't see.
✅ Largest files eagerly reachable from
|
| Size | File |
|---|---|
| 899.9 KiB | src/styles/global.scss |
| 609.0 KiB | public/hedgehog/burning-money-hog.png |
| 541.9 KiB | public/hedgehog/waving-hog.png |
| 448.2 KiB | public/hedgehog/stop-sign-hog.png |
| 362.0 KiB | public/hedgehog/phone-pair-hogs.png |
| 355.9 KiB | ../node_modules/.pnpm/@posthog+icons@0.37.3_react-dom@18.3.1_react@18.3.1__react@18.3.1/node_modules/@posthog/icons/dist/posthog-icons.es.js |
| 343.0 KiB | src/taxonomy/core-filter-definitions-by-group.json |
| 335.6 KiB | public/hedgehog/desk-hog.png |
| 323.2 KiB | public/hedgehog/3-bears-hogs.png |
| 298.6 KiB | src/lib/api.ts |
Largest files eagerly reachable from src/scenes/AuthenticatedShell.tsx
| Size | File |
|---|---|
| 899.9 KiB | src/styles/global.scss |
| 764.5 KiB | src/queries/validators.js |
| 609.0 KiB | public/hedgehog/burning-money-hog.png |
| 541.9 KiB | public/hedgehog/waving-hog.png |
| 448.2 KiB | public/hedgehog/stop-sign-hog.png |
| 362.0 KiB | public/hedgehog/phone-pair-hogs.png |
| 355.9 KiB | ../node_modules/.pnpm/@posthog+icons@0.37.3_react-dom@18.3.1_react@18.3.1__react@18.3.1/node_modules/@posthog/icons/dist/posthog-icons.es.js |
| 343.0 KiB | src/taxonomy/core-filter-definitions-by-group.json |
| 335.6 KiB | public/hedgehog/desk-hog.png |
| 323.2 KiB | public/hedgehog/3-bears-hogs.png |
Posted automatically by check-eager-graph · sizes are input-source bytes from the esbuild metafile · part of #32479
Prompt To Fix All With AIFix the following 2 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 2
products/agent_platform/services/agent-runner/src/loop/driver.ts:894-897
The `reason` parameter (third argument to `onFallback`) is silently dropped in the warning log. When a fallback fires in production the log entry will show which model was skipped and at which index, but not the error that actually caused the skip — making it much harder to distinguish a 429 storm from a 503 outage after the fact.
```suggestion
onFallback: (fromIndex, fromModel, reason) => {
fellBackFrom = fromModel.id
runLog.warn({ from: fromModel.id, attempt: fromIndex, reason }, 'model.fallback')
},
```
### Issue 2 of 2
products/agent_platform/services/agent-shared/src/runtime/analytics-sink.ts:216-218
Guarding a numeric property with a bare truthiness check treats `0` as absent. The driver currently never sends `0` (it sends `undefined` when `modelAttempt` is zero), so this works today. However the `AnalyticsGenerationEvent` type allows any `number` value, so a future call-site that passes `0` would silently drop the property. `!== undefined` matches the intent and is consistent with how every other optional numeric field on this event is guarded above (e.g. `cache_read_tokens`, `cost_usd`).
```suggestion
if (event.model_attempt !== undefined) {
base.$agent_model_attempt = event.model_attempt
}
```
Reviews (1): Last reviewed commit: "chore: update OpenAPI generated types" | Re-trigger Greptile |
|
Size Change: +383 kB (+0.6%) Total Size: 64.4 MB 📦 View Changed
ℹ️ View Unchanged
|
… skill-read masking The builder had no machine-readable view of the agent `spec`, so authoring a revision meant guessing the shape and hitting validation errors one field at a time (chat-trigger auth, reasoning enum, secrets union, allowed_hosts), then reverse-engineering other agents' specs to recover the format. - Add native `@posthog/agent-applications-spec-schema` returning `z.toJSONSchema(AgentSpecSchema)` — generated from the canonical Zod source, so it can't drift from what the API validates; computed in-process with no round-trip. The sibling of `agent-applications-native-tools-list`. - Wire it into the agent-builder spec + the agent.md tool surface, and trim the authoring / editing / mental-model skills that just restated the spec shape, pointing them at the tool instead (keeps the judgment, drops the duplication). - Fix readBundleFile/load-skill collapsing every read failure into a permanent-sounding "not found in the bundle": distinguish a genuine 404 (null) from an operational error (surfaced as retryable), so a transient store blip no longer makes the builder give up and run without its skills. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
…allback
Replace the single `spec.model` with a `model_policy`:
- `auto` (default): pick a level (low/medium/high); the platform resolves it
to a priority-ordered, cross-provider model list (MODEL_POLICY_LEVELS) at
session start, kept current centrally so auto agents get new models without
a revision.
- `manual`: an explicit priority-ordered model list, with per-model reasoning.
Legacy `spec.model` still works (normalized to a 1-element manual policy);
`modelPolicyToList(spec)` is the single resolution point.
Runner: resolve the full ordered list once per session and wrap the stream with
`fallbackStreamFn` — on a transient/provider-side failure (5xx/conn/timeout/429)
it retries the next model; permanent 4xx and aborts pass through, and it can't
fall over once the stream has committed (first content event). Single-model
sessions skip the wrapper (identical to today). `$ai_generation` tags the model
that actually answered plus `$agent_model_attempt`/`$ai_fallback_from` markers.
Also: mirror the schema in the Django write-schema (+ tests), and rewrite the
meta-agent `choosing-the-model` skill + example spec.json for the new shape.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… spec.model)
BREAKING (pre-release): `spec.model` is gone. Agents select models solely via
`spec.model_policy` (auto level — the default — or a manual priority list).
`modelPolicyToList(spec)` resolves it; `spec.model_policy` defaults to
`{ mode: 'auto', level: 'medium' }` when omitted.
- zod: drop `model`, `model_policy` carries the auto/medium default, drop the
legacy `resolveModelPolicy` shim.
- Django write-schema: drop `model`, add `model_policy` to required (relaxed via
its default); update schema tests.
- Migrate every example spec.json (manual single-model preserves their pinned
model) and all spec-constructing tests/harness to `model_policy`.
- Update the meta-agent skills that referenced `spec.model`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
fb519fb to
dfd66fd
Compare
MCP UI Apps size report
|
# Conflicts: # products/agent_platform/backend/migrations/max_migration.txt # products/agent_platform/backend/presentation/views.py # products/agent_platform/backend/tests/test_approvals_api.py # products/agent_platform/services/agent-ingress/src/triggers/slack.ts # products/agent_platform/services/agent-shared/src/persistence/test-reset.ts # products/agent_platform/services/agent-shared/src/runtime/slack-failure-notifier.ts # products/agent_platform/services/agent-tests/src/examples/agent-builder/agent.md # products/agent_platform/services/agent-tests/src/examples/agent-builder/skills/authoring-new-agents/SKILL.md # products/agent_platform/services/agent-tests/src/examples/agent-builder/skills/editing-agents-safely/SKILL.md # products/agent_platform/services/agent-tests/src/examples/agent-builder/skills/platform-mental-model/SKILL.md # products/agent_platform/services/agent-tests/src/examples/agent-builder/spec.json # products/agent_platform/services/agent-tools/src/tools/posthog-spec-schema.v1.test.ts # products/agent_platform/services/agent-tools/src/tools/posthog-spec-schema.v1.ts
…aster code The merge pulled in master code/tests still using the removed top-level spec.model: the posthog-ai example + its test, the agent-builder inspect mock fixture, and the spec-schema tool test (asserted model in required). Converted each to model_policy so the merged tree typechecks and the example specs stay valid against the branch's schema. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Query snapshots: Backend query snapshots updatedChanges: 2 snapshots (2 modified, 0 added, 0 deleted) What this means:
Next steps:
|
…el availability
The gateway's served-model list becomes the source of truth for what a
model_policy can run; a curated grouping (MODEL_POLICY_LEVELS) is reconciled
against it instead of drifting. Closes the gap where a model string only failed
as a runtime 400 in a user's session (the original `claude-haiku-4-5` /
`gpt-5-thinking` mismatches).
- GatewayCatalog (agent-shared): typed, TTL-cached read of `{gateway}/v1/models`
via DirectHttpClient; fails open (serves last-good / [] on a blip). Pure
helpers: acceptedModelIds, isModelServable, validateModelPolicy,
validateModelLevels (CI guard for tier drift), filterServableEntries.
- Author-time gate: the janitor's validate (+ freeze) rejects a model_policy
referencing unserved models with `invalid_model`; Django surfaces the
structured report errors so the author sees which model. model_policy is
immutable post-freeze, so promote is covered.
- Runtime: the worker filters the resolved model list against the live catalog
before dispatch (never empties a non-empty list).
- Discovery: `@posthog/agent-applications-models` returns served models +
per-Mtok pricing + context window via ctx.gatewayCatalog; the
choosing-the-model skill now points at it. Fixed the drifted `gpt-5-thinking`
high-tier entry.
- Fixed a pre-existing post-merge regression: TypedSpecSchema (strict) still
required `model`, rejecting model_policy specs on PUT /bundle.
Frontend model picker deferred (no existing model UI). typecheck clean;
agent-shared 509, agent-tools 114, agent-runner 184, agent-janitor 147.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # products/agent_platform/services/agent-runner/src/loop/driver.test.ts # products/agent_platform/services/agent-runner/src/loop/driver.ts
Two gateway-path bugs in the runner driver: - Multi-model turns failed every tool call with "Tool posthog_meta-end-turn not found". fallbackStreamFn wrapped sanitizingStreamFn, so the fallback's re-emitted `done` event resolved result() with the provider-safe tool name before the sanitizing layer could translate it back. Reorder so sanitizing is outermost (gateway-metadata inner, fallback middle). - Settled cost was always $0. The runner looked up GET /v1/usage by an id it generated itself, but the gateway mints its own settlement reference and returns it in the X-Request-ID response header. Capture that via pi-ai's onResponse and key the usage lookup by it; keep Idempotency-Key for retry dedup, drop the ignored outbound X-Request-Id. Adds a multi-model regression test (safe tool name echoed through the fallback wrapper) and updates the gateway-metadata cost tests + a fail-open case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…a native tool
Expose the served-model catalog as one source feeding both the config UI (REST)
and the agent builder (PostHog MCP), instead of a bespoke native tool.
- janitor: GET /models → { models (id, provider, context, USD/Mtok pricing),
levels } from the gateway catalog + MODEL_POLICY_LEVELS (single source).
- Django: GET /agent_applications/models/ (operation_id agent_applications_models)
proxies the janitor; PAT read-scoped. The PostHog MCP surfaces it as a tool.
- agent-builder: drop the native @posthog/agent-applications-models tool; add
agent-applications-models to its MCP allowlist (next to get-llm-total-costs).
- remove the native tool (tool + test + registry entry).
- MODEL_POLICY_LEVELS.high → [opus-4.7, gpt-5-pro] (drop the sonnet fallback).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- driver.ts: include the fallback `reason` in the `model.fallback` warn log so a 429 storm is distinguishable from a 503 outage after the fact. - analytics-sink.ts: guard `model_attempt` with `!== undefined` (a future call-site passing `0` would otherwise silently drop the property). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Reviews (2): Last reviewed commit: "fix(agent-platform): address review — lo..." | Re-trigger Greptile |
# Conflicts: # products/agent_platform/services/agent-runner/src/loop/driver.test.ts # products/agent_platform/services/agent-runner/src/loop/driver.ts
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
1 similar comment
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
…lity, throw path) - spec.test.ts: unit-test modelPolicyToList — auto-level expansion + order, auto/medium default, reasoning precedence (per-entry → policy → spec), manual passthrough. It's the spec.models → priority-list contract feeding fallback. - driver.test.ts: assert the direct-path $ai_generation carries model_attempt and fallback_from after a fallover (the AI-observability signal for the feature). - fallback-stream.test.ts: cover the defensive catch branch — base THROWS (vs emits an error event): eligible throw → fallover, non-eligible → surfaced. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Query snapshots: Backend query snapshots updatedChanges: 3 snapshots (2 modified, 1 added, 0 deleted) What this means:
Next steps:
|
|
⏭️ Skipped snapshot commit because branch advanced to The new commit will trigger its own snapshot update workflow. If you expected this workflow to succeed: This can happen due to concurrent commits. To get a fresh workflow run, either:
|
…ize_for Multi-turn agent sessions are dominated by prompt-cache locality: each turn replays a large stable prefix, and cache reads are ~0.1-0.5x of full input. Switching providers mid-session re-reads the whole context cold (full price + a cache-write premium), so the right unit of stability is the session. Add `spec.models.optimize_for` (default `cost`): - cost: the first turn walks the priority list until a model answers, then PINS that model for the rest of the session — every later turn uses only it, no cross-model failover. Keeps one provider's cache warm; if the pinned model is down the turn fails (after pi-ai's same-provider retries) rather than paying a cold re-read on a different provider. - availability: leads with the last-served model (sticky, no thrash) but DOES fail over on failure, trading the cold re-read for keeping the session alive. fallbackStreamFn now tracks the served model across turns and orders attempts accordingly; the pin/lead is seeded from the conversation's last assistant turn so it survives suspend→resume. Hook indices stay in original policy order, so $ai_generation's model_attempt/fallback_from keep their meaning. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st happy-path, stale pin) - cost: pins the PRIMARY when it serves first and won't fall over on a later failure (the common-case guarantee; prior cost tests all started with the primary failing). - availability: a stuck MIDDLE model leads, then the rest follow in original priority order ([b, a, c]) — exercises the reorder beyond the 2-model case. - a stale initialServedId not in the list is ignored → walk from the primary. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…chema The model-policy fields carried only enum values, so the agent builder editing a spec saw `optimize_for: [cost, availability]` / `level: [low, medium, high]` with no rationale — it had to rely on the choosing-the-model skill (which predates optimize_for entirely). Add JSON-schema descriptions for level, reasoning, optimize_for, mode, and the manual entries, so the spec is self-documenting: the descriptions flow spec_schema.py → OpenAPI → the MCP spec-edit tool input schema → the builder's context. Regenerated MCP types + tool-schema snapshots. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Resolve conflicts from two master efforts that overlapped this branch's spec.models work: - model-pattern PR: master added a `<provider>/<model-id>` regex to ModelIdSchema and the Django spec schema, plus model-id tests/fixtures. This branch replaces the singular `model` field with the `models` policy (auto/manual) and intentionally relaxes ModelIdSchema to `min(1)` (validation moves to the gateway catalog). Took this branch's `models` shape in spec_schema.py, spec.ts, tests, and the generated frontend/MCP types + tool-schema snapshots; reverted ModelIdSchema to the loose `min(1)` so manual lists (e.g. `mock-static:hello`) parse. - remove-is_preview PR (#65978): master removed the `is_preview` session marker and `isPreviewSideEffect`. This branch predated it. Applied the removal to the two files where the merge had retained it (ToolContext.isPreview in tool.ts, build-agent-tools.ts) while keeping this branch's new `gatewayCatalog` field and master's `webSearchProviders`. Also corrected two pre-existing latent test assertions surfaced by the merge (real-PG suites that omit the defaulted `optimize_for: 'cost'` on a round-tripped manual model policy) in pg-impls.test.ts and the janitor server.test.ts. Verified: typecheck across all agent_platform node services; agent-shared (526), agent-runner (203), agent-janitor (152), MCP (1947) and Django spec-schema (45) tests pass; ruff clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wrap the search_text/models round-trip assertion added during the master merge — it exceeded oxfmt's line width and broke the agent-shared and frontend format checks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Pre-existing test debt unrelated to the master merge: a few test fixtures still used the old top-level `model` field (removed in favor of `models`) or omitted the `optimize_for: 'cost'` default that ModelPolicySchema adds on parse. - test_revision_create.py: `_SPEC` POSTs through the validating create endpoint, which now 400s on the unknown `model` key. Translated to the equivalent manual `models` policy. - typed-bundle-authoring.test.ts / example-posthog-ai.test.ts: assert the `models` shape (not the old `model`) and include the defaulted `optimize_for`. Backend suite passes (149); example-posthog-ai parse test passes; agent-tests typecheck clean. The typed-bundle cases need the full e2e stack (Postgres+Redis+Kafka+SeaweedFS+gateway) to run, so those assertion fixes are verified against the CI failure output, not a local run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # products/agent_platform/frontend/generated/api.schemas.ts # products/agent_platform/services/agent-tests/src/examples/agent-builder/skills/editing-agents-safely/SKILL.md # services/mcp/src/generated/agent_platform/api.ts
The PATCH-spec case sent `{"model": "..."}` at the top level, which the
post-merge schema rejects with `additional properties not allowed` — the
field is now `spec.models` (nested mode/list). Updated the PATCH body
and assertion to the manual-mode shape.
Query snapshots: Backend query snapshots updatedChanges: 1 snapshots (1 modified, 0 added, 0 deleted) What this means:
Next steps:
|
- fallback-stream: guard the detached IIFE so no path leaves the stream
hanging. Post-commit re-throw and any unexpected throw from the
dispatch loop now surface as a stop=error result via the outer
`run().catch(...)` (callers awaiting `.result()` previously hung).
- fallback-stream: new `FallbackHooks.onPinLost` fired when the
session's sticky model is no longer in the policy list. Previously a
delisted model silently unpinned and lost prompt-cache warmth.
- spec: restore `<provider>/<model-id>` regex on `ModelIdSchema`
(zod + Python JSON schema). Bare ids freeze fine but the gateway 400s
on the first session — catch at authoring time, not at runtime.
- gateway-catalog `validateModelPolicy`: format check is now
unconditional (independent of catalog availability). The fail-open
on servability still applies, but a malformed id is always wrong.
- Tests: bare id / uppercase provider / missing model id rejection
cases on both Python and TS schemas; pin-lost callback hookup;
IIFE guard pinned by post-commit-rejection + sync-throw cases.
- Fix `mock-static:hello` → `mock-static/hello` typo in pg-impls test.
Followups for the PR author (not in this commit):
- No Django data migration for old rows with top-level `spec.model`.
zod strips unknown keys, so an authored opus pick silently becomes
the auto/medium default. Worth a one-shot backfill or a fail-loud
shim before the breaking-change release.
- gateway-catalog `list()` inflight dedup serves stale cache on fetch
failure; consider surfacing `{value, fresh, lastError}` so the
caller can log.
|
Items I flagged in review but deliberately left for you — they're judgment calls or larger surface changes you should drive. 1. Gateway catalog inflight dedup serves stale cache on fetch failureIn Fix changes the interface CatalogResult {
models: CatalogModel[]
fresh: boolean
lastError?: Error
}Callers can then log on 2. Driver fail-open path can ship delisted models to the gatewayRelated to the first point. Suggestion: distinguish "catalog unreachable" (use cached list, log loudly) from "catalog returned empty" (terminal — refuse to start). Today both look identical downstream. |
Query snapshots: Backend query snapshots updatedChanges: 2 snapshots (2 modified, 0 added, 0 deleted) What this means:
Next steps:
|
The tightened ModelIdSchema regex (`<provider>/<model-id>`) rejects bare ids like `x`/`y`, breaking 17 server.test.ts cases on AgentSpecSchema.parse. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
spec.modelandspec.entrypointare removed. Model selection is solely viaspec.models. Pre-release, so no back-compat shim — all example bundles + tests are migrated in this PR.What & why
Agents ran on a single
spec.modelwith no fallback when a provider was down. This replaces it withspec.models(anauto/manualpolicy → a priority-ordered, cross-provider list) and adds session-aware model selection + runtime fallback.spec.modelsauto(default): pick a level (low/medium/high); the platform resolves it to a priority-ordered, cross-provider list (MODEL_POLICY_LEVELS) at session start, kept current centrally so auto agents get new models without a new revision.manual: an explicit priority-orderedmodels[]list, with optional per-modelreasoning.optimize_for(both, defaultcost): session model stability vs. resilience — see below.{ mode: 'auto', level: 'medium', optimize_for: 'cost' }.modelPolicyToList(spec)is the single resolution point.Session-sticky selection (
optimize_for) — the cost storyMulti-turn agent sessions are dominated by prompt-cache locality: each turn replays a large stable prefix, and cache reads are ~0.1–0.5× of full input. Switching providers mid-session re-reads the whole context cold (full price + a cache-write premium), so the right unit of stability is the session, not the request.
cost(default): the first turn walks the list until a model answers, then pins it for the rest of the session — every later turn uses only it, no cross-model failover. One provider, warm cache. If the pinned model is down the turn fails (after pi-ai's same-provider retries) rather than paying a cold re-read on a different provider.availability: leads with the last-served model (sticky, no thrash) but fails over on failure, trading the cold re-read for keeping the session alive.The pin/lead is derived from the conversation's last-served assistant turn, so it survives suspend→resume.
Runtime fallback
fallbackStreamFnwraps the stream over the resolved list:optimize_for:costpins after the first success (no further failover);availabilitykeeps failover enabled across turns.$ai_generationtags the model that actually answered plus$agent_model_attempt/$ai_fallback_from.Model catalog (single source)
GET …/agent_applications/models/(→ janitor/models→ gateway catalog) feeds both the config UI (REST) and the agent builder (PostHog MCP toolagent-applications-models, now generated + served). Replaces the bespoke native models tool.Changes
agent-shared/.../spec.ts): removemodel+entrypoint;modelsunion (auto/manual) +optimize_for;MODEL_POLICY_LEVELS;modelPolicyToList.agent-runner): list resolution inworker.ts;loop/fallback-stream.ts(sticky lead + cost/availability, + tests);driver.tsseeds the sticky model from the conversation + per-attempt analytics;analytics-sink.tsfallback props.backend/logic/spec_schema.py+ tests):models+optimize_for; relaxed-required via field defaults.services/mcp):agent-applications-modelsenabled + regenerated; tool-schema snapshots regenerated foroptimize_for.spec.json(manual single-model preserves their pinned model; agent-builder uses auto/high) and every spec-constructing test/harness →models.choosing-the-modelskill (+platform-mental-model,cost-and-quota-analysis) updated for the new shape; live-catalog pricing.Verification
fallback-stream27,driver30 (incl. fallback integration + sticky wiring),agent-sharedspec + gateway-catalog 111, Pythontest_spec_schema46.master.Follow-ups
optimize_forv2: cost-aware escalation — for big/async sessions, suspend-and-retry the pinned provider before failing over (theavailabilitycold re-read is bounded to one full-context read, but real); optional per-trigger policy. Documented in the design note.MODEL_POLICY_LEVELSis the v1 backend-owned tiering; can later move to a DB/feature-flag/gateway-owned source.frontend/generated/*(orval output, currently unimported) still references the old names — regenerate from the OpenAPI when convenient.🤖 Generated with Claude Code