Skip to content

One scaffold, two modes: harness in core + rLLM AgentTrainer adapter#352

Merged
larstalian merged 7 commits into
mainfrom
pedantic-bouman-8eed1d
Jun 25, 2026
Merged

One scaffold, two modes: harness in core + rLLM AgentTrainer adapter#352
larstalian merged 7 commits into
mainfrom
pedantic-bouman-8eed1d

Conversation

@larstalian

Copy link
Copy Markdown
Collaborator

The same framework-free agent loop now drives both evaluation and training — they differ only in the injected Sampler. This PR puts that loop in core and adds the first training-side adapter (rLLM's AgentTrainer).

Layers

  • Core harness (1e87387): run_agent / arun_rollouts — brief → sample an action → run it against the live world → observe → grade. AgentSandbox moves into openrange.core as the shell primitive; the task briefing folds out of examples/ into core. Domain-agnostic (boundary-clean).
  • rLLM adapter (03c1646): packages/openrange-rllm maps one rollout onto rLLM's Episode/Trajectory/Step, wraps the harness as an @rllm.rollout flow + an @rllm.evaluator, and ships a GatewaySampler that routes the policy through rLLM's gateway (reusing OpenRange's OpenAI-compatible backend) so rLLM captures token ids + logprobs. build_rllm_dataset_rows / snapshot_resolver are the tested dataset seam. examples/rllm_grpo_cyber.py wires a cyber world pool into AgentTrainer.

Reward

The pack's existing dense subgoal ladder (reached_endpoint → extracted_anything → matched_flag) flows straight through to rLLM's EvalOutput; no reward logic is added, and the submission-gated success / observed-but-not-rewarded-leak stance is unchanged.

Verification

  • 918 passed / 0 failed; coverage 86% (≥80). Harness 100% branch cov; rLLM adapter 100% (locally, where rLLM is installed).
  • No mocks anywhere: real PROCESS-backed webapps reached over HTTP, a real reference-exploit policy, the real consequence verifier, and rLLM's real pydantic types.
  • Every rLLM contract was checked against the real source (rllm-org/rllm v0.3.0rc0) plus an adversarial review pass — all 8 integration points confirmed.
  • ruff, mypy (adapter added to CI scope), and the core/pack boundary check all pass.

Gated / out of scope

  • rLLM is not on PyPI → installed from source; the live AgentTrainer.train() needs a CUDA GPU (verl backend), so the example's training step is gated and not exercised in CI.
  • Trajectory / SFT-data synthesis is owned by a separate effort — deliberately not included here.

🤖 Generated with Claude Code

A gym owns worlds and the grade; add the thin agent loop that turns a
realized world into a graded trajectory. `run_agent` drives one episode
(brief -> sample an action -> run it against the live world -> observe
-> grade); training and evaluation share the loop and differ only in the
injected Sampler. `arun_rollouts` overlaps many episodes on one shared
EpisodeService.

- Move AgentSandbox into the core package (`openrange.core.sandbox`) as
  the shell primitive the agent acts through; reword its docs to stay
  domain-agnostic.
- Fold the task briefing (task + the live interface contract) out of
  examples into core so any harness can build it.

No mocks: the harness tests drive a real PROCESS-backed webapp over HTTP
with a real reference-exploit policy and the real consequence verifier.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian marked this pull request as draft June 24, 2026 15:22
Add `openrange-rllm`, a thin, import-light adapter onto rLLM's RL
trainer (OpenRange owns the world and grade; rLLM owns the loop):

- `agent_rollout_to_episode` maps one OpenRange rollout onto rLLM's
  Episode/Trajectory/Step (one step per turn, in call order).
- `make_rollout` wraps the harness as an `@rllm.rollout` flow;
  `make_evaluator` surfaces the verifier's grade as an `@rllm.evaluator`.
- `GatewaySampler` routes the policy through rLLM's gateway (reusing
  OpenRange's OpenAI-compatible backend), so rLLM captures token ids and
  logprobs and the train/eval split is just which endpoint it points at.
- `build_rllm_dataset_rows` / `snapshot_resolver` are the tested dataset
  seam (snapshot_id/task_id ride in Task.metadata).

The pack's dense subgoal reward flows straight through; no reward logic
is added. `examples/rllm_grpo_cyber.py` wires a cyber world pool into
`AgentTrainer` (CPU to build; `.train()` is the CUDA boundary).

Every rLLM contract was checked against the real source (rllm-org/rllm
v0.3.0rc0); rLLM core types are CPU-importable, so the tests use the
real types with no mocks. rLLM is not on PyPI, so it is installed from
source and the live trainer is gated; CI mypy covers the adapter.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian force-pushed the pedantic-bouman-8eed1d branch from 03c1646 to 36ac320 Compare June 24, 2026 16:15
larstalian and others added 4 commits June 24, 2026 19:57
End-to-end GPU validation (a 1.5B GRPO step completed over real cyber
worlds on an A100 via our make_rollout) surfaced two things the Mac
tests and the adversarial review both missed:

- agent_rollout_to_episode set Episode.termination_reason to a raw
  STRING, but rLLM's verl transform does `reason.value` (it expects the
  TerminationReason enum) and crashes on a str. Leave it unset — rLLM
  fills its own enum — and keep the value in artifacts. Regression test
  updated (it had asserted the buggy string).

- The example's documented run command was naive and would fail.
  Replace it with the validated single-GPU recipe and record the two
  gotchas that cost real debugging: LoRA needs the FLAT `lora_rank` key
  (the nested `lora.rank` is silently ignored -> full fine-tuning ->
  OOM), and OpenRange's Python 3.14 requirement (PEP 758 `except`
  syntax) conflicts with the verl GPU stack's Python 3.12-only wheels.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The rLLM/verl GPU training stack (torch/vLLM/flash-attn) ships only
Python 3.12 wheels, but OpenRange pinned 3.14 and used the 3.14-only
PEP 758 unparenthesized `except A, B:` syntax in 8 spots — so OpenRange
and the trainer can't share a process and the training path can't run.

- Parenthesize those 8 excepts (behaviorally identical).
- Lower requires-python to >=3.12 across the workspace.
- Retarget ruff (py312) and mypy (3.12) so 3.14-only syntax is caught
  going forward.

Verified: compiles on a real Python 3.12 interpreter, and the suite
still passes on 3.14 (957 passed, 88% coverage, boundary clean).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts:
#	packages/openrange-pack-sdk/src/openrange_pack_sdk/_runtime.py
one-scaffold-two-modes.md was a multi-PR north-star, not this PR's design —
and the PR diverged from it: it shipped the rLLM shim first, not the doc's
TRL-first sequence (sec 10), and the doc's line-refs (sec 6/9) were already
drifting (e.g. AgentSandbox is already in core). The shim's durable,
user-facing essence already lives in the openrange-rllm README (the thin
seam + the gateway logprob capture); add a pointer to the runnable example
and drop the broader two-modes framing that doesn't serve this shim. The
remaining roadmap (TRL re-point, async G1-G5, verl/skyrl shims) stays
trackable as issues.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian marked this pull request as ready for review June 25, 2026 03:27
Removed three comments in test_rllm_shim.py that just restated their
assertions (.rules: don't explain WHAT). Kept the shim's termination_reason
comment — it is a genuine why, and I verified the root fix is the wrong
trade: mapping our terminal_reason to rLLM's TerminationReason enum would
couple the import-light shim to rLLM's heavy engine import (numpy + the
rollout engine chain) and break the CPU-only path the adapter and its tests
rely on. The comment now records that constraint too.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@larstalian larstalian merged commit ec54646 into main Jun 25, 2026
2 checks passed
@larstalian larstalian deleted the pedantic-bouman-8eed1d branch June 25, 2026 03:42
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 25, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant