Pinned scope decisions for the EXPAND tier. What we built, what we deliberately did NOT, and what's gated on Phase B evidence.
Wire-format substrate (shipped):
@tangle-network/agent-eval/hosted— public client + types for shipping eval-run events + trace spans to any orchestrator that speaks the wire format.docs/hosted-ingest-spec.md— semver-committed wire spec (HostedWireVersion = "2026-05-26.v1").examples/hosted-ingest-server/— minimal hono-based reference receiver (~200 LOC). Executable spec. Stays as the reference even after the production orchestrator ships.selfImprove({ hostedTenant })opt-in — when set, the substrate POSTs the final eval-run event to the configured endpoint. Failures are logged but never fail the loop (LAND tier never blocks on EXPAND-tier infra).
Production orchestrator (started):
- HTTP ingest service in
@tangle-network/monorepoaccepting the wire format. Lives under the orchestrator app. Tenant auth + isolation- persistent storage + read endpoints. Started this session — see the @tangle-network/agent-dev-container PR. Not feature-complete: tenant CRUD + adversarial isolation tests pending.
The wedge doc gates these on Phase B evidence — partner-validated signal about what the hosted product actually needs to do. Shipping them without that signal risks building the wrong thing.
| Deferred until Phase B passes | Why |
|---|---|
| Metered billing wire-up (Stripe + cost-ledger) | The billable units (per-eval-run, per-ingested-MB, per-seat) depend on actual partner consumption patterns. Picking dimensions in a vacuum locks us into wrong pricing. |
| Multi-tenant dashboard UX | Partners' first dashboard request defines the right default views. We have a stub list-runs page; the rest is post-signal. |
| Webhook callbacks per tenant | The events partners want pushed (gate-decided, cost-threshold, regression-alert) are partner-shaped. Add them when a partner asks. |
| Cross-tenant aggregation / benchmarking | This is the "Datadog for agents" tier — explicit roadmap, requires user volume we don't have. |
| Sandbox-cost roll-up into hosted billing | Cross-product billing integration requires PLATFORM-tier partners. Out of scope until at least one. |
| Trace UI | OTel-shape spans store fine. Visualization comes after partners ask. Phoenix / Jaeger / any OTLP-compatible viewer covers it in the interim. |
| Soc2 / compliance audit work | Required for enterprise; not required for design partners. |
These are committed and won't change without a major-version wire bump or a documented migration:
- Wire format is JSON over HTTP, not gRPC. Reasons: works in browsers + edge + node + curl; OTel-compatible at the trace stream level; lowest possible barrier to a self-hosted orchestrator.
- Tenant auth is bearer-token + tenant-id header, not OIDC / service-account / mutual-TLS. Reasons: simplest thing that's actually secure with proper key handling; defers complex IAM until enterprise demand.
- Idempotency via header, not transactional API. Servers MUST
dedupe by
(tenantId, Idempotency-Key)for 24h. Simpler than making clients commit transactions. - Eval-runs and traces are SEPARATE streams with pivot keys
(
tangle.runIdetc.) on spans. Reasons: traces can be best-effort (lossy) without corrupting eval-run semantics; orchestrators can prioritize eval-run durability without forcing trace durability. - Wire version is a date.v-N string, not semver. Reasons: dates communicate "when was this contract frozen"; v-N captures incremental breaking changes between dates.
When the design-partner pairing happens, capture answers to these explicitly:
- Surface confidentiality: do partners want the verbatim surface
(system prompt) shipped, or just the hash? Today the wire format
has
surface?as optional; partner default is what we ship. - Trace sampling: at what cells-per-second do trace spans become noise? What's the right default sampling rate?
- Cost attribution granularity: per cell? per generation? per run? Per judge dimension? Partner needs determine what we surface in billing reports.
- Replay: do partners want to re-run an old eval-run from the stored data? That would require us to store more than the summary — actual artifacts + prompts. Storage cost implication.
- PII / sensitive scenarios: how do partners want to handle scenarios containing user data? Encryption-at-rest is table stakes; redaction-at-ingest may be required for some.
The partner pairing kit (docs/phase-b-pairing-kit.md) has discovery
questions that probe these.
This RFC does NOT plan for:
- Replacing Langfuse / Phoenix / Arize. We INGEST OTel; we don't build a generic trace viewer. The dashboard is eval-run-shaped, not trace-shaped.
- Becoming a model gateway. Tangle Router exists; the hosted orchestrator routes to Tangle Router by default but doesn't duplicate its function.
- Becoming an LLM-call CDN. Caching is the consumer's job (their agent code, their HTTP client). We don't intercept LLM calls.
- Building an "agents IDE." Substrate, not surface.
When Phase B passes the gate, the production orchestrator finishes:
- Replace in-memory store with Postgres (tenant data) + S3 (large artifacts) OR Cloudflare D1 + R2 (Workers-native).
- Wire metered events to Stripe + the cost-ledger.
- Tenant CRUD UI + onboarding flow.
- Multi-tenant dashboard MVP (list runs, drill into one, diff generations, view shipped prompt).
- Adversarial tenant-isolation test battery in CI.
- Webhooks + observability for the orchestrator itself.
Estimated effort post-Phase-B: ~1 week focused work for one engineer. This is fast precisely BECAUSE the wire format is locked and the reference receiver exists — the production server is a different implementation of the same contract.