This is the end state every Tangle product agent (gtm, legal, tax, creative, agent-builder, blueprint, physim) converges to. It is the target the consumer migrations build toward — not a 1:1 port of whatever eval/improvement code a product has today.
Thesis. A product agent is one closed, automated self-improvement loop
that makes the agent measurably better over time while humans only approve
PRs. A product should NOT have a "production loop" and a pile of eval/*
CLIs and bespoke optimization orchestration. It has one loop, composed
from the substrate. Everything else is deleted.
Primitives reference: primitives-integration-spec.md.
Engine internals: self-improvement-engine.md.
1. SAMPLE the eval matrix
scenarios = cartesian(
profileVariants, // the surface(s) under test: baseline + candidates
productScenarios, // the hard product tasks (gtm: attribution honesty, …)
personas, // simulated users / drivers
) ∪ productionFailures // real failures pulled from the LabeledScenarioStore
// (the flywheel: prod traces become eval scenarios)
2. MEASURE — runCampaign
dispatch(scenario) = runMultishot({ // the multi-turn challenging flow
persona: scenario.persona, // driver = simulated user
profile: productAgentProfile(surface), // worker = the agent under test
shape: scenario.flow, // the real task, many turns
tools: productTools, // real tools, real side-effects
}) → transcript artifact
judges = product ensemble (domain dimensions) → scorecard + bootstrap CIs
labeledStore: capture EVERY cell (scenario, artifact, score, source) → the dataset
3. ANALYZE — trace analysts (runAnalystLoop / AnalystRegistry)
read the campaign traces → a research report (failure modes, why, where).
This REPLACES bespoke "failure clustering": the analyst is the richer,
LLM-driven version of "what should we improve and why".
4. IMPROVE — runImprovementLoop( improvementDriver + agenticGenerator )
driver.propose({ report, dataset, … }) → candidate surfaces.
The agentic generator runs a coding harness in a worktree, reading the
report + the codebase, making REAL product changes — prompt, tools, AND
code — not just an addendum string. Each candidate is measured on a
HELD-OUT slice of the matrix.
5. GATE — defaultProductionGate (+ domain gates, composed)
heldout-delta + budget + red-team + reward-hacking + canary, plus any
product-specific gate (e.g. anti-fabrication) and an overfit-gap check.
Verdict ∈ ship | hold | need_more_work | model_ceiling | arch_ceiling.
6. PROMOTE — openAutoPr
the winning worktree → a PR against the product repo. Human approves → ships.
(autoOnPromote: 'pr'. Live self-mutation is deferred behind the full safety
stack.)
7. LOOP
the shipped, improved agent runs in production → emits traces → the dataset
grows → back to (1). The loop is scheduled (cron) and/or triggered when the
analyst report crosses a severity threshold. Autonomous between PR approvals.
One entry point, no new abstractions. A product exposes a single
run<Product>ImprovementCycle() that composes the substrate primitives
above. It does NOT define runFooPromptEvolution, FooOptimizer,
FooProductionLoop, etc. The substrate carries every name; the product only
wires its domain pieces into the seams.
OWNS (domain — stays, this is the product's value):
productScenarios— the hard tasks the agent must handle.personas— the simulated users that drive the multi-shot flows.judges/ rubrics / dimension weights — how "good" is defined.productTools— the real tools the agent uses.- deterministic checks (anti-slop, format, forbidden-claim) — fast pre-judges.
- domain gates (e.g. anti-fabrication) — composed into the gate.
DELETES (orchestration the substrate now owns):
- every
for (gen of generations)mutate→score→select loop. - bespoke prompt-evolution / production-loop / analyst-loop wrappers.
- trial-matrix construction, frontier tracking, seed plumbing, manifest hashing, cell caching, scorecard aggregation, CI math.
- PR-opening scaffolding, worktree git plumbing.
- parallel
eval/*CLIs that each re-implement a slice of the above.
COMPOSES (the substrate, in the one cycle):
runCampaign(matrix measurement) ·runMultishot(the dispatch flow) ·FsLabeledScenarioStore(dataset) · analysts (report) ·runImprovementLoop+improvementDriver+agenticGenerator(improve) ·defaultProductionGate+composeGate(gate) ·openAutoPr(promote).
- One cycle, one entry. A single
run<Product>ImprovementCycle()composes the substrate; the old eval/improvement systems are deleted, not coexisting. - Matrix eval is real.
dispatchruns genuine multi-shot persona↔agent flows with real tools — not single-turn projections, not stubbed workers (non-zero token usage is asserted). - The dataset is fed. Every cell captures to
LabeledScenarioStorewith correct provenance; production failures flow back in as scenarios. - Improvement is code-real. The agentic generator produces worktree changes (prompt/tools/code), measured on holdout — not just addendum-string mutation.
- The gate is honest. Composed
defaultProductionGate+ domain gates + overfit-gap; fails closed; holdout never overlaps train. - Promotion is a PR.
openAutoPropens it; a human approves; nothing auto-deploys. - It's scheduled + triggered. Runs on cadence and/or when the analyst report crosses severity; autonomous between approvals.
- Tests + a real proof run. Contract tests assert the wiring; one real end-to-end cycle produces a scorecard and (on a shipping gate) a PR.
Anything short of this is mid-migration, not done.
| Loop step | gtm wiring |
|---|---|
| SAMPLE | profile variants of OPERATOR_CEO_SYSTEM_PROMPT + addendum; GTM_LOOP_HOLDOUT_SCENARIOS + eval/business-owner/personas.json; production failures from the trace store |
| MEASURE | dispatch = runMultishot(persona ↔ gtm-agent via runChatThroughRuntime, real tools); judges = the 3-model ensemble (attribution_honesty, proposal_grounding) + canonical 12-dim |
| ANALYZE | trace analysts over the campaign traces → report (supersedes FailureClusterConfig clustering) |
| IMPROVE | improvementDriver + agenticGenerator (claude harness) edits prompt/tools/code in a worktree, fed the report |
| GATE | composeGate(defaultProductionGate, antiFabricationGate, overfitGapGate) |
| PROMOTE | openAutoPr → PR against tangle-network/gtm-agent |
Deleted: eval/run-prompt-evolution.ts, eval/analyst-loop.ts,
eval/optimization-campaign.ts, scripts/evals/*, the orchestration body of
production-loop/index.ts and eval/canonical.ts.
Kept: scenarios, personas, judges, tools, deterministic checks, the
composeProductionLoopSystemPrompt wiring.
Result: one runGtmImprovementCycle(); ~3–4k LOC of scattered orchestration
gone, replaced by a substrate composition.
This gtm build is the reference the other six products copy.