One scaffold, two modes: harness in core + rLLM AgentTrainer adapter#352
Merged
Conversation
A gym owns worlds and the grade; add the thin agent loop that turns a realized world into a graded trajectory. `run_agent` drives one episode (brief -> sample an action -> run it against the live world -> observe -> grade); training and evaluation share the loop and differ only in the injected Sampler. `arun_rollouts` overlaps many episodes on one shared EpisodeService. - Move AgentSandbox into the core package (`openrange.core.sandbox`) as the shell primitive the agent acts through; reword its docs to stay domain-agnostic. - Fold the task briefing (task + the live interface contract) out of examples into core so any harness can build it. No mocks: the harness tests drive a real PROCESS-backed webapp over HTTP with a real reference-exploit policy and the real consequence verifier. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add `openrange-rllm`, a thin, import-light adapter onto rLLM's RL trainer (OpenRange owns the world and grade; rLLM owns the loop): - `agent_rollout_to_episode` maps one OpenRange rollout onto rLLM's Episode/Trajectory/Step (one step per turn, in call order). - `make_rollout` wraps the harness as an `@rllm.rollout` flow; `make_evaluator` surfaces the verifier's grade as an `@rllm.evaluator`. - `GatewaySampler` routes the policy through rLLM's gateway (reusing OpenRange's OpenAI-compatible backend), so rLLM captures token ids and logprobs and the train/eval split is just which endpoint it points at. - `build_rllm_dataset_rows` / `snapshot_resolver` are the tested dataset seam (snapshot_id/task_id ride in Task.metadata). The pack's dense subgoal reward flows straight through; no reward logic is added. `examples/rllm_grpo_cyber.py` wires a cyber world pool into `AgentTrainer` (CPU to build; `.train()` is the CUDA boundary). Every rLLM contract was checked against the real source (rllm-org/rllm v0.3.0rc0); rLLM core types are CPU-importable, so the tests use the real types with no mocks. rLLM is not on PyPI, so it is installed from source and the live trainer is gated; CI mypy covers the adapter. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
03c1646 to
36ac320
Compare
End-to-end GPU validation (a 1.5B GRPO step completed over real cyber worlds on an A100 via our make_rollout) surfaced two things the Mac tests and the adversarial review both missed: - agent_rollout_to_episode set Episode.termination_reason to a raw STRING, but rLLM's verl transform does `reason.value` (it expects the TerminationReason enum) and crashes on a str. Leave it unset — rLLM fills its own enum — and keep the value in artifacts. Regression test updated (it had asserted the buggy string). - The example's documented run command was naive and would fail. Replace it with the validated single-GPU recipe and record the two gotchas that cost real debugging: LoRA needs the FLAT `lora_rank` key (the nested `lora.rank` is silently ignored -> full fine-tuning -> OOM), and OpenRange's Python 3.14 requirement (PEP 758 `except` syntax) conflicts with the verl GPU stack's Python 3.12-only wheels. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The rLLM/verl GPU training stack (torch/vLLM/flash-attn) ships only Python 3.12 wheels, but OpenRange pinned 3.14 and used the 3.14-only PEP 758 unparenthesized `except A, B:` syntax in 8 spots — so OpenRange and the trainer can't share a process and the training path can't run. - Parenthesize those 8 excepts (behaviorally identical). - Lower requires-python to >=3.12 across the workspace. - Retarget ruff (py312) and mypy (3.12) so 3.14-only syntax is caught going forward. Verified: compiles on a real Python 3.12 interpreter, and the suite still passes on 3.14 (957 passed, 88% coverage, boundary clean). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
# Conflicts: # packages/openrange-pack-sdk/src/openrange_pack_sdk/_runtime.py
one-scaffold-two-modes.md was a multi-PR north-star, not this PR's design — and the PR diverged from it: it shipped the rLLM shim first, not the doc's TRL-first sequence (sec 10), and the doc's line-refs (sec 6/9) were already drifting (e.g. AgentSandbox is already in core). The shim's durable, user-facing essence already lives in the openrange-rllm README (the thin seam + the gateway logprob capture); add a pointer to the runnable example and drop the broader two-modes framing that doesn't serve this shim. The remaining roadmap (TRL re-point, async G1-G5, verl/skyrl shims) stays trackable as issues. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Removed three comments in test_rllm_shim.py that just restated their assertions (.rules: don't explain WHAT). Kept the shim's termination_reason comment — it is a genuine why, and I verified the root fix is the wrong trade: mapping our terminal_reason to rLLM's TerminationReason enum would couple the import-light shim to rLLM's heavy engine import (numpy + the rollout engine chain) and break the CPU-only path the adapter and its tests rely on. The comment now records that constraint too. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The same framework-free agent loop now drives both evaluation and training — they differ only in the injected
Sampler. This PR puts that loop in core and adds the first training-side adapter (rLLM'sAgentTrainer).Layers
1e87387):run_agent/arun_rollouts— brief → sample an action → run it against the live world → observe → grade.AgentSandboxmoves intoopenrange.coreas the shell primitive; the task briefing folds out ofexamples/into core. Domain-agnostic (boundary-clean).03c1646):packages/openrange-rllmmaps one rollout onto rLLM'sEpisode/Trajectory/Step, wraps the harness as an@rllm.rolloutflow + an@rllm.evaluator, and ships aGatewaySamplerthat routes the policy through rLLM's gateway (reusing OpenRange's OpenAI-compatible backend) so rLLM captures token ids + logprobs.build_rllm_dataset_rows/snapshot_resolverare the tested dataset seam.examples/rllm_grpo_cyber.pywires a cyber world pool intoAgentTrainer.Reward
The pack's existing dense subgoal ladder (
reached_endpoint → extracted_anything → matched_flag) flows straight through to rLLM'sEvalOutput; no reward logic is added, and the submission-gated success / observed-but-not-rewarded-leak stance is unchanged.Verification
rllm-org/rllmv0.3.0rc0) plus an adversarial review pass — all 8 integration points confirmed.ruff,mypy(adapter added to CI scope), and the core/pack boundary check all pass.Gated / out of scope
AgentTrainer.train()needs a CUDA GPU (verl backend), so the example's training step is gated and not exercised in CI.🤖 Generated with Claude Code