RL bridge primitives + worked examples + downstream integrations.
What's in 0.23
RL bridge — @tangle-network/agent-eval/rl (new subpath)
9 stable primitives:
run-record-adapters— convert legacy optimization output (TrialResult,VerificationReport,VariantAggregate) → canonicalRunRecord[]verifiable-reward— extract clean reward signals; distinguishes'deterministic'(compile/test/schema/sandbox) from'probabilistic'(judge)preferences— DPO/PPO/KTO(chosen, rejected)triples with three documented strategiesoff-policy— IPS, SNIPS, doubly-robust estimators (Dudík–Langford–Li 2011; Owen 2013 SE)process-reward— step-level credit assignment; PRM training data shape (Lightman et al. 2023)contamination— held-out perturbation probe via paired Wilcoxontournament— Hunter's MM Bradley-Terry + online Eloadversarial— hill-climb scenario searchcompute-curves—runComputeCurve,bestOfN,selfConsistency, Pareto frontier (Snell et al. 2024)
7 experimental primitives (interfaces marked experimental in barrel):
active-curriculum— Neyman optimal allocation + Thompson samplingreward-hacking— 4-signal Goodhart watchdog (Krakovna/Skalse/Kim)adaptation-eval— k-shot adaptation curvesexporters— DPO/GRPO/SFT/PRM/step-rewards JSONLrl-campaign— top-level orchestrator wrappingrunEvalCampaign+ RL bridgeauto-research—analyzeOptimizationResult, the unification primitivepredictive-validity-researcher— concreteResearcherinterface implementation
RunRecord.scenarioId — canonical optional field
Populated automatically by runEvalCampaign and the optimization adapters. Closes the fragility flagged in the 0.23 audit.
Worked examples
examples/auto-research-with-agent-builder/— runnable demo of the closed loop. Synthetic agent-builder driver iterates 4 generations; score climbs 0.739 → 0.973.examples/fine-tune-with-prime-rl/— concrete prime-rl SFT integration. FilterRunRecord[]to high-quality, project viatoSftRows, write 15-line TOML, runuv run sft @ .... ~150 LoC of glue.
Architecture docs
docs/three-package-architecture.md— agent-eval × agent-knowledge × agent-runtime contractsdocs/auto-research-loop-end-to-end.md— composition pattern with explicit invariants
Downstream integrations (separate repos, all PRs open)
- agent-knowledge tangle-network/agent-knowledge#5 — clean bump, 12/12 tests pass
- agent-runtime tangle-network/agent-runtime#3 — clean bump + scenarioId backfill, 16/16 tests pass
- agent-builder tangle-network/agent-builder#130 — bump + RL bridge wired into
runAutoResearchCycle. Every auto-research cycle now produces canonicalRunRecord[], preference triples, reward-hacking verdict, and sequential interim verdict on the events stream. 826/826 tests pass.
Numbers
- 1017 / 1017 tests passing on agent-eval main (+150 cumulative since 0.21)
- typecheck + build clean
- dist/rl.{js,d.ts} entry emits
Version lockstep
- npm
@tangle-network/agent-eval@0.23.0 - PyPI
agent-eval-rpc==0.23.0
References
Dudík/Langford/Li 2011 (DR), Owen 2013 (SNIPS), Hunter 2004 (BT MM), Lightman 2023 (PRM), Snell 2024 (test-time compute), plus the 0.21/0.22 foundational citations.