If you're an agent picking this up: read this page, then run pnpm help + pnpm gate —
do NOT re-derive the harness from source. This map is SHORT on purpose; if it disagrees
with the code, the code wins — fix this page in the same turn (the anti-rediscovery law).
Verified against source 2026-06-10 · agent-eval pinned ^0.83.0. The CANONICAL surface is now
the published optimization suite (@tangle-network/agent-runtime/loops): Environment +
Strategy/defineStrategy + runBenchmark — see the section below FIRST; the older
runExperiment/corpus-replay paths remain for the legacy gates.
The success criterion is Gate B (docs/learning-flywheel.md, docs/architecture.md §2): across repeated runs on a persistent, checkable task family, the deployed policy's verifier-graded multi-objective score (correct · fast · secure · cheap, each its own deployable checker) improves run-over-run at matched per-run compute, surviving a frozen-policy control, significant at adequate n. That across-run slope is RSI. The harness has NOT yet run Gate B — see the durable gap below.
What the harness measures today is Gate A (docs/roadmap-rsi.md — the inner GO/NO-GO for the
within-run adaptive-driver layer): does any non-blind topology beat blind compute at EQUAL COMPUTE
(Σ rollouts × turns — k counts rollouts, each may be multi-turn/stateful), under a DEPLOYABLE
(non-oracle) selector, at significant n? Gate A is a narrow diagnostic — the cost-justification
for parallel/adaptive topology, NOT the product verdict. A failed Gate A deletes within-run
steering only; it never touches the corpus+policy product (Gate B). The invariant is equal-COMPUTE,
not equal-k-on-stateless-samples.
Terminology (one word, used consistently). A rollout (≡ a "shot") is ONE agent running an
AgentProfile to completion — a full, possibly multi-turn / stateful trajectory. k counts
rollouts; turns live inside a rollout, never as separate shots. A single stateless
completion (maxTurns=0, harness: null, one model call, no persistent workspace) is the
degenerate rollout — fine as a selector lower bound, never the canonical unit. The HumanEval
probe (bench/src/humaneval-gate.mts) uses exactly that degenerate shape — it calls the router
directly and does not route through AgentProfile / the sandbox / the keystone — so its numbers
are the no-self-correction lower bound on the selector, distinct from the rollout-based keystone
gate above. Bridge it to the product by running the same arms with real rollouts (an AgentProfile
through runLoop), dialing maxTurns.
Two things to keep straight: today's judges grade a single correctness scalar (the multi-objective vector is the open contract, architecture.md §6), and every number below is single-objective + within-run — read them as Gate-A diagnostics, not Gate-B results.
- Within-run STEER (verify-and-revise family) LOSES (rung-0, n=40: blind 37.5% → random@3 60.0% → refineGepa@3 45.0%; the earlier +20pp was confounded compute).
- On the COMMITTED finsearch corpus, the self-consistency selector also loses: selector@k − random@k = −8.2pp (n=51). So "pick the consensus among k identical-ish attempts" does not beat a random draw here.
- VERIFIER-GROUNDED selection is the one selector that wins — but ONLY where a domain has
WITHIN-TASK graded variance. Proven POSITIVE on HumanEval (deployable-checker, binary, n=50:
verifier−sc = +12.0pp CI[+4,+22]). The continuous-reward generalization (
selector.tssummarizeVerifierSelector,corpus-replay --selector=verifier) ranks k attempts by their stored deployable-checkerscoreand reports selector(=best-of-k) vs random(=mean-of-k) with a paired bootstrap CI. aec-bench is structurally DEAD for it (n=12 gpt-4.1, 0/12 random + 1/12 diverse tasks have any within-task score spread): closed-form engineering calcs score deterministically w.r.t. sampling — across-task difficulty (33% resolve) but ~0 within-task selection headroom. The gpt-5 null (oracle 2.5%) was a worker artifact (no JSON emission); gpt-4.1 fixes the band (mean 36.1%) but the selector gate stays flat. commit0 is the right Layer-1 domain (different impls pass different test subsets → real within-task spread). - UNTESTED (still Gate A): parallel DIVERSE strategies (different reasoning paths,
directives.ts→DIVERSE_STRATEGY_LENSES/composeStrategies) @k vs blind sample(n=k). A distinct family from what rung-0 falsified — the open within-run question, and what runProgram'sparallelis built to deploy. - UNBUILT (Gate B): the across-run policy-improvement curve on a multi-objective task stream. No harness runs it yet; it is the durable next step, not a corpus-replay over the existing single-objective records.
Every run's full artifact is committed under agent-lab/runs/<date>/ — self-describing JSON
(models + config + per-task cells + gate verdicts with CIs), portable to any repo without
this codebase. agent-lab/runs/RUNS.md is the index mapping artifacts → verdicts → the
findings gist. Set OUT=runs/<date>/<name>.json (in agent-lab) on every run — never /tmp
(a reboot erases ramdisk; ~20 runs nearly died there once).
rollout (worker → answer) → adapter.judge (valid?) → CORPUS RunRecord (k attempts, output+valid each) → corpus-replay --selector (pick WITHOUT the judge) → corpus-report CI → gate verdict
The expensive part (rollouts) produces a reusable corpus; selection + stats are free
and offline (zero new rollouts, zero judge calls).
The optimization layer ships from the package; bench scripts compose it. A domain = an
Environment (5 hooks); a strategy = how budget is spent to beat its check; runBenchmark
returns per-strategy means + the per-task LOSSES table + the (score,$) Pareto frontier.
Promotion is the package gate (promotionGate — seeded paired bootstrap, evidence floor,
two modes: superiority and non-inferiority = score CI low > −tolerance AND cost
savings CI low > 0, the "same quality, cheaper" gate; verdicts carry paired Δlatency).
Authoring is authorStrategy (named fallbackModel retry). Funnel-alignment law: the
search-side champion tie-band must be no stricter than the gate's tolerance (under
OBJECTIVE=cost it defaults to it). Endurance envs on the evolve runner: CHECKPOINT=path
(phase ledger + resume — a killed run re-pays ONE phase), GYM_RECREATE='docker …'
(recreate the container at phase boundaries — the wedge killer). Observability:
createWaterfallCollector (every spawn billed+timed) + anytimeReport (TTT / shots-to-
target / COCO ERT / hill-climb AUC per satisficing target). Models policy: cheap router
models only (defaults deepseek-v4-pro/deepseek-v4-flash; compressor = flash with
gpt-4o-mini fallback) — never CC models; every verdict banner + artifact is
self-describing (models + config).
| entry point | what it answers | one-liner |
|---|---|---|
| the research lines | the flywheel/evolution runs, σ×κ factor grid, steering hypercube, model matrix, E3 certified memory, depth-vs-breadth, corpus A/Bs — moved to tangle-network/agent-lab (private) with the EOPS/math domains and the run archive | ~/code/agent-lab — map in its README |
src/commit0-env-run.mts |
the HARD domain (implement whole libraries vs their test suites) through runBenchmark |
IDS=commit-0/wcwidth BUDGET=3 INNER_TURNS=10 tsx src/commit0-env-run.mts |
src/examples/strategy-demo.mts |
the 3-layer API demo (gym-free) | WORKER_MODEL=gpt-4o-mini tsx src/examples/strategy-demo.mts |
src/examples/math-demo.mts |
any-domain proof: math via createVerifierEnvironment (the tax/legal/gtm answer-shape) |
BUDGET=3 tsx src/examples/math-demo.mts |
EOPS standup (one container): docker run -d --rm --name eops -p 8006:8005 shivakrishnareddyma225/enterpriseops-gym-mcp-itsm:latest + EOPS_GYM_DBS_DIR=<unzipped gym_dbs.zip from github.com/ServiceNow/EnterpriseOps-Gym>; restart it FRESH per big run
(it wedges under load); EOPS_SPLIT=csm|hr|… selects other domains (their gym containers
not yet sourced). Parallel lanes: tasks carry the dataset's literal gym URL
(http://localhost:8006); EOPS_GYM_URL=http://localhost:8007 rebases every server URL,
so N concurrent runs use N containers (-p 8007:8005, -p 8008:8005, …) instead of
serializing on one wedge-prone gym. Bring-up check: agent-lab/domains/lane-probe.mts. Cross-cutting laws baked into the suite: keep-best checkpoint scoring
(final-state scoring is biased −6–8pp), equal compute via the conserved pool, the analyst
is firewalled (trace-only), costs are real (router usage → {usd, ms, tokens}).
- Relevance-primed corpus A/B —
PRIME_MODE=relevance K_FACTS=2 N=16 HOLDOUT=4(the read-side design that survived the naive-priming negative). Strategy tournament at power— RAN (n=24, budget 4, ×3 configs): HOLD verdicts; the cost-frontier finding ×3 + the funnel-alignment law came out of these. Live ledger:.evolve/current.json+ the findings gist.- Commit0 at real budget —
BUDGET=3 INNER_TURNS=12 N=3sample-vs-refine on the hard domain. - Cross-domain replication — blocked on sourcing the csm/hr gym containers (
EOPS_SPLITis wired).
run.ts: help · preflight · verify-judge · solve-one · solve-one-local · solve-cad · solve-browser · ui-review · batch-blind · batch-oracle · batch-compare standalone tools (NOT in run.ts — the gate lives here): corpus-replay.mts --selector: selector@k vs random@k vs oracle@k over a corpus (THE offline gate) corpus-report.mts paired-bootstrap CI + Benjamini-Hochberg over corpora improve-prompt.ts GEPA-optimize a directive vs a held-out gate + paired CI (selfImprove) finsearch-loop.ts the real runLoop+createDriver closed loop on FinSearchComp terminal-compare.ts Terminal-Bench compare (own main, not in run.ts) unit tests (the only fully-green, cred-free runnable surface besides offline replay): node --test --import tsx src/{selector,compare-decomp,steering-experiment,refine-loop}.test.mts
cd bench
pnpm gate # = corpus-replay.mts corpus/finsearch.jsonl --selector
tsx src/corpus-replay.mts corpus/finsearch.jsonl --selector --condition=refine # other arms
tsx src/corpus-replay.mts <corpus.jsonl> --selector=verifier # GRADED domains: rank k by deployable-checker score
pnpm gate-report # paired-bootstrap CI + BH-FDR
--selector=verifier is for corpora whose attempts carry a continuous score (commit0
pytest pass-rate / aec verify.py partial credit) and where text doesn't cluster: it ranks by
the deployable checker (argmax score) and reports selector vs random with a paired bootstrap CI.
It needs WITHIN-TASK score spread to move — flat on aec (closed-form), live on commit0 (code).
The committed corpus/finsearch.jsonl (152 records: random@3 / refineHand@3 / refineGepa@3)
makes the gate replayable with no rollouts. To gate the DIVERSE arm you must first generate
a diverse-strategy corpus (k different composeStrategies prefixes per instance) — that
generator is the in-progress work; the identical-directive control corpus is batch-oracle.
cd bench
export TANGLE_API_KEY=… # router + the deployable judge
BENCH=enterpriseops-gym EOPS_FIXTURES=1 N=20 K=4 pnpm keystone-gate
keystone-gate-cli.mts → runKeystoneGate (src/keystone-gate.ts): a Persona + the generic
fanout combinator over the budget-conserving Supervisor. Blind = K identical children, diverse
= K distinct strategy directives — equal-k by construction (conserved pool), proven by
equalKOnCost. The DEPLOYABLE selector is the benchmark's OWN adapter.judge (each child solves
via the router, is graded by the runnable checker, and that BenchScore is the child's verdict
defaultSelectWinner ranks on — selector ≠ oracle/LLM-judge). Pick a deployable-checker bench
(enterpriseops-gym / swe-bench / terminal-bench), NOT finsearchcomp (LLM-judge → not deployable).
Offline plumbing test (no creds): tsx src/keystone-gate.test.mts. This is the two-runtime
reconciliation — the gate now runs through the SAME recursive atom every personified loop uses.
BENCH=hotpotqa HOTPOTQA_FIXTURES=1 RESEARCH=1 CORPUS=/tmp/identical.jsonl K=4 tsx src/run.ts batch-oracle 30
tsx src/corpus-replay.mts /tmp/identical.jsonl --selector
(hotpotqa is cheap + deterministic-judge but near-ceiling/weak-signal; simpleqa similar; finsearchcomp is the strong-signal domain but needs the sandbox/local-web worker.)
BENCH=hotpotqa RESEARCH=1 ROUTER_KEY=… tsx src/improve-prompt.ts # POP/GENS/TRAIN_N/HOLDOUT_N envs
GEPA optimizes the shared base directive; the diverse lenses (directives.ts) layer on top.
RESEARCH=1→ local opencode, model-knowledge QA (cheap; works today, conc≤2)SANDBOX=1→ prod-sandbox web-search worker (FinSearchComp real path; historically infra-flaky)- default → local code-patch worker (SWE-bench; judge needs bench/.venv + Docker)
The steer text lives in
directives.ts, NOT in the worker (the worker is substrate). A strategy is a prompt PREFIX; the judge is unchanged.
The code-benches share benchmarks/_harness.ts (stage artifact → run the bench's OWN evaluator
in a .venv/Docker subprocess → parse its JSON report → {resolved,score}). No per-adapter
copy of the process/venv/Docker/temp/report plumbing; commit0+appworld also share its
stdin-piping runner (runVenvScriptStdin).
- Real, runnable with ZERO extra deps: finsearchcomp (GitHub dataset + fixtures + LLM judge — the gate bench), hotpotqa + simpleqa + frames (HF/web QA + F1/LLM judge;
*_FIXTURES=1offline), aec-bench (real GitHub task tree + fixtures; judge = the task's owntests/verify.pyover python3 stdlib — deterministic, graded per-field partial credit, no Docker, no LLM → the candidate non-oracle correctable-middle-band bench for the open gate). - Real code, needs an external harness/tools to run (fail loud with the exact install/Docker fix; never a fabricated score): swe-bench + terminal-bench (
bench/.venv+ Docker), commit0 (ISOLATEDbench/.venv-commit0viapython3 -m venv bench/.venv-commit0 && bench/.venv-commit0/bin/pip install commit0 datasets— its deps conflict with the shared.venv; override dir withCOMMIT0_VENV— plus Docker; judge = official pytest harness, graded (passed+xfail)/total; the rollout prompt stages in-box (clonescommit-0/<repo>@base_commit, emitsgit diff);COMMIT0_FIXTURES=1for offline listing), programbench (pip install programbench+ Docker on linux/amd64 + HF blobs; judge = official cleanroom eval, graded passed/total;PROGRAMBENCH_FIXTURES=1offline), appworld (pip install appworld+appworld install+appworld download data; judge = AppWorld's ownworld.evaluate(), graded passes/num_tests — NO committed fixture: task data exists only afterdownload data, so loadTasks fails loud rather than fabricate a task), mind2web, cad-design + cadbench + cadgenbench (openscad/blender/build123d). - goldArtifact: aec-bench returns the task's real
golden_pass.md(verify-judge works fully offline). commit0 / programbench / appworld returnundefined— the oracle is a git ref / stripped source / engine-bundled solution, not a portable string; judge correctness is proven by a real solve through the harness, not a synthetic gold (documented + fail-loud, not a fake). - Absent (not built): swe-gym, swe-bench-multimodal, and the rest of the survey set.
Every unbuilt/scaffold adapter fails LOUD (throws with the integration step) rather than faking a score — no silent zeros in any corpus. Offline fixture tests:
benchmarks/{aec-bench,commit0,programbench,appworld}.test.mts(tsx --test).
tsx src/run.ts help # the real command list (source of truth)
tsx src/run.ts preflight # harness/worker reachable for BENCH?
Creds: the router/sandbox paths read ROUTER_KEY/SANDBOX_KEY (+ ROUTER_BASE/SANDBOX_BASE_URL)
from the environment. Source them from the operator's private secret store (documented in the
global agent config, NOT here — this repo is public) into the run process; never print them.
NOT needed for the offline selector gate, the hotpotqa/swe-bench deterministic judges, or
RESEARCH=1 local-opencode rollouts — if unset, those paths are cred-blocked, not code-blocked.
run.ts help is now real (the command map). Next: lift the standalone tools into a single
command registry + a test asserting every cmd === 'X' and every package.json script
appears in help. Then help IS the map and this page is just the narrative.