One scaffold, two modes: harness in core + rLLM AgentTrainer adapter by larstalian · Pull Request #352 · vecna-labs/open-range

larstalian · 2026-06-24T15:18:27Z

The same framework-free agent loop now drives both evaluation and training — they differ only in the injected Sampler. This PR puts that loop in core and adds the first training-side adapter (rLLM's AgentTrainer).

Layers

Core harness (1e87387): run_agent / arun_rollouts — brief → sample an action → run it against the live world → observe → grade. AgentSandbox moves into openrange.core as the shell primitive; the task briefing folds out of examples/ into core. Domain-agnostic (boundary-clean).
rLLM adapter (03c1646): packages/openrange-rllm maps one rollout onto rLLM's Episode/Trajectory/Step, wraps the harness as an @rllm.rollout flow + an @rllm.evaluator, and ships a GatewaySampler that routes the policy through rLLM's gateway (reusing OpenRange's OpenAI-compatible backend) so rLLM captures token ids + logprobs. build_rllm_dataset_rows / snapshot_resolver are the tested dataset seam. examples/rllm_grpo_cyber.py wires a cyber world pool into AgentTrainer.

Reward

The pack's existing dense subgoal ladder (reached_endpoint → extracted_anything → matched_flag) flows straight through to rLLM's EvalOutput; no reward logic is added, and the submission-gated success / observed-but-not-rewarded-leak stance is unchanged.

Verification

918 passed / 0 failed; coverage 86% (≥80). Harness 100% branch cov; rLLM adapter 100% (locally, where rLLM is installed).
No mocks anywhere: real PROCESS-backed webapps reached over HTTP, a real reference-exploit policy, the real consequence verifier, and rLLM's real pydantic types.
Every rLLM contract was checked against the real source (rllm-org/rllm v0.3.0rc0) plus an adversarial review pass — all 8 integration points confirmed.
ruff, mypy (adapter added to CI scope), and the core/pack boundary check all pass.

Gated / out of scope

rLLM is not on PyPI → installed from source; the live AgentTrainer.train() needs a CUDA GPU (verl backend), so the example's training step is gated and not exercised in CI.
Trajectory / SFT-data synthesis is owned by a separate effort — deliberately not included here.

🤖 Generated with Claude Code

A gym owns worlds and the grade; add the thin agent loop that turns a realized world into a graded trajectory. `run_agent` drives one episode (brief -> sample an action -> run it against the live world -> observe -> grade); training and evaluation share the loop and differ only in the injected Sampler. `arun_rollouts` overlaps many episodes on one shared EpisodeService. - Move AgentSandbox into the core package (`openrange.core.sandbox`) as the shell primitive the agent acts through; reword its docs to stay domain-agnostic. - Fold the task briefing (task + the live interface contract) out of examples into core so any harness can build it. No mocks: the harness tests drive a real PROCESS-backed webapp over HTTP with a real reference-exploit policy and the real consequence verifier. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add `openrange-rllm`, a thin, import-light adapter onto rLLM's RL trainer (OpenRange owns the world and grade; rLLM owns the loop): - `agent_rollout_to_episode` maps one OpenRange rollout onto rLLM's Episode/Trajectory/Step (one step per turn, in call order). - `make_rollout` wraps the harness as an `@rllm.rollout` flow; `make_evaluator` surfaces the verifier's grade as an `@rllm.evaluator`. - `GatewaySampler` routes the policy through rLLM's gateway (reusing OpenRange's OpenAI-compatible backend), so rLLM captures token ids and logprobs and the train/eval split is just which endpoint it points at. - `build_rllm_dataset_rows` / `snapshot_resolver` are the tested dataset seam (snapshot_id/task_id ride in Task.metadata). The pack's dense subgoal reward flows straight through; no reward logic is added. `examples/rllm_grpo_cyber.py` wires a cyber world pool into `AgentTrainer` (CPU to build; `.train()` is the CUDA boundary). Every rLLM contract was checked against the real source (rllm-org/rllm v0.3.0rc0); rLLM core types are CPU-importable, so the tests use the real types with no mocks. rLLM is not on PyPI, so it is installed from source and the live trainer is gated; CI mypy covers the adapter. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

End-to-end GPU validation (a 1.5B GRPO step completed over real cyber worlds on an A100 via our make_rollout) surfaced two things the Mac tests and the adversarial review both missed: - agent_rollout_to_episode set Episode.termination_reason to a raw STRING, but rLLM's verl transform does `reason.value` (it expects the TerminationReason enum) and crashes on a str. Leave it unset — rLLM fills its own enum — and keep the value in artifacts. Regression test updated (it had asserted the buggy string). - The example's documented run command was naive and would fail. Replace it with the validated single-GPU recipe and record the two gotchas that cost real debugging: LoRA needs the FLAT `lora_rank` key (the nested `lora.rank` is silently ignored -> full fine-tuning -> OOM), and OpenRange's Python 3.14 requirement (PEP 758 `except` syntax) conflicts with the verl GPU stack's Python 3.12-only wheels. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The rLLM/verl GPU training stack (torch/vLLM/flash-attn) ships only Python 3.12 wheels, but OpenRange pinned 3.14 and used the 3.14-only PEP 758 unparenthesized `except A, B:` syntax in 8 spots — so OpenRange and the trainer can't share a process and the training path can't run. - Parenthesize those 8 excepts (behaviorally identical). - Lower requires-python to >=3.12 across the workspace. - Retarget ruff (py312) and mypy (3.12) so 3.14-only syntax is caught going forward. Verified: compiles on a real Python 3.12 interpreter, and the suite still passes on 3.14 (957 passed, 88% coverage, boundary clean). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

# Conflicts: # packages/openrange-pack-sdk/src/openrange_pack_sdk/_runtime.py

one-scaffold-two-modes.md was a multi-PR north-star, not this PR's design — and the PR diverged from it: it shipped the rLLM shim first, not the doc's TRL-first sequence (sec 10), and the doc's line-refs (sec 6/9) were already drifting (e.g. AgentSandbox is already in core). The shim's durable, user-facing essence already lives in the openrange-rllm README (the thin seam + the gateway logprob capture); add a pointer to the runnable example and drop the broader two-modes framing that doesn't serve this shim. The remaining roadmap (TRL re-point, async G1-G5, verl/skyrl shims) stays trackable as issues. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Removed three comments in test_rllm_shim.py that just restated their assertions (.rules: don't explain WHAT). Kept the shim's termination_reason comment — it is a genuine why, and I verified the root fix is the wrong trade: mapping our terminal_reason to rLLM's TerminationReason enum would couple the import-light shim to rLLM's heavy engine import (numpy + the rollout engine chain) and break the CPU-only path the adapter and its tests rely on. The comment now records that constraint too. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

larstalian marked this pull request as draft June 24, 2026 15:22

larstalian force-pushed the pedantic-bouman-8eed1d branch from 03c1646 to 36ac320 Compare June 24, 2026 16:15

larstalian and others added 4 commits June 24, 2026 19:57

Merge remote-tracking branch 'origin/main' into pedantic-bouman-8eed1d

e5377ec

# Conflicts: # packages/openrange-pack-sdk/src/openrange_pack_sdk/_runtime.py

larstalian marked this pull request as ready for review June 25, 2026 03:27

larstalian merged commit ec54646 into main Jun 25, 2026
2 checks passed

larstalian deleted the pedantic-bouman-8eed1d branch June 25, 2026 03:42

github-actions Bot locked and limited conversation to collaborators Jun 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

One scaffold, two modes: harness in core + rLLM AgentTrainer adapter#352

One scaffold, two modes: harness in core + rLLM AgentTrainer adapter#352
larstalian merged 7 commits into
mainfrom
pedantic-bouman-8eed1d

larstalian commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

larstalian commented Jun 24, 2026

Layers

Reward

Verification

Gated / out of scope

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant