Skip to content

v0.23.0 — RL primitives + auto-research worked example

Latest

Choose a tag to compare

@drewstone drewstone released this 08 May 23:04
· 49 commits to main since this release
6c124e7

RL bridge primitives + worked examples + downstream integrations.

What's in 0.23

RL bridge — @tangle-network/agent-eval/rl (new subpath)

9 stable primitives:

  • run-record-adapters — convert legacy optimization output (TrialResult, VerificationReport, VariantAggregate) → canonical RunRecord[]
  • verifiable-reward — extract clean reward signals; distinguishes 'deterministic' (compile/test/schema/sandbox) from 'probabilistic' (judge)
  • preferences — DPO/PPO/KTO (chosen, rejected) triples with three documented strategies
  • off-policy — IPS, SNIPS, doubly-robust estimators (Dudík–Langford–Li 2011; Owen 2013 SE)
  • process-reward — step-level credit assignment; PRM training data shape (Lightman et al. 2023)
  • contamination — held-out perturbation probe via paired Wilcoxon
  • tournament — Hunter's MM Bradley-Terry + online Elo
  • adversarial — hill-climb scenario search
  • compute-curvesrunComputeCurve, bestOfN, selfConsistency, Pareto frontier (Snell et al. 2024)

7 experimental primitives (interfaces marked experimental in barrel):

  • active-curriculum — Neyman optimal allocation + Thompson sampling
  • reward-hacking — 4-signal Goodhart watchdog (Krakovna/Skalse/Kim)
  • adaptation-eval — k-shot adaptation curves
  • exporters — DPO/GRPO/SFT/PRM/step-rewards JSONL
  • rl-campaign — top-level orchestrator wrapping runEvalCampaign + RL bridge
  • auto-researchanalyzeOptimizationResult, the unification primitive
  • predictive-validity-researcher — concrete Researcher interface implementation

RunRecord.scenarioId — canonical optional field

Populated automatically by runEvalCampaign and the optimization adapters. Closes the fragility flagged in the 0.23 audit.

Worked examples

  • examples/auto-research-with-agent-builder/ — runnable demo of the closed loop. Synthetic agent-builder driver iterates 4 generations; score climbs 0.739 → 0.973.
  • examples/fine-tune-with-prime-rl/ — concrete prime-rl SFT integration. Filter RunRecord[] to high-quality, project via toSftRows, write 15-line TOML, run uv run sft @ .... ~150 LoC of glue.

Architecture docs

  • docs/three-package-architecture.md — agent-eval × agent-knowledge × agent-runtime contracts
  • docs/auto-research-loop-end-to-end.md — composition pattern with explicit invariants

Downstream integrations (separate repos, all PRs open)

Numbers

  • 1017 / 1017 tests passing on agent-eval main (+150 cumulative since 0.21)
  • typecheck + build clean
  • dist/rl.{js,d.ts} entry emits

Version lockstep

  • npm @tangle-network/agent-eval@0.23.0
  • PyPI agent-eval-rpc==0.23.0

References

Dudík/Langford/Li 2011 (DR), Owen 2013 (SNIPS), Hunter 2004 (BT MM), Lightman 2023 (PRM), Snell 2024 (test-time compute), plus the 0.21/0.22 foundational citations.