Skip to content

Commit ec54646

Browse files
authored
One scaffold, two modes: harness in core + rLLM AgentTrainer adapter (#352)
1 parent 6ab2f91 commit ec54646

34 files changed

Lines changed: 2323 additions & 181 deletions

File tree

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ jobs:
3636
run: bash scripts/check_boundary.sh
3737

3838
- name: Typecheck
39-
run: uv run mypy src tests examples main.py packages/openrange-trl/src
39+
run: uv run mypy src tests examples main.py packages/openrange-trl/src packages/openrange-rllm/src
4040

4141
- name: Test with coverage
4242
run: uv run coverage run -m pytest tests

examples/_briefing.py

Lines changed: 0 additions & 26 deletions
This file was deleted.

examples/codex_eval.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -25,7 +25,7 @@
2525
TaskSpec,
2626
)
2727

28-
from examples._briefing import agent_briefing
28+
from openrange.agent import agent_briefing
2929
from openrange.agent_backend import CodexAgentBackend
3030
from openrange.core import PACKS, auto_evolve, consequence_gate
3131
from openrange.core.episode import AgentTurn, EpisodeReport

examples/rllm_grpo_cyber.py

Lines changed: 162 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,162 @@
1+
"""Train a cyber agent on an OpenRange world pool with rLLM's ``AgentTrainer``.
2+
3+
This is the rLLM half of "one scaffold, two modes": the *same* agent loop that
4+
``examples/codex_eval.py`` evaluates with is trained here, swapping only the
5+
sampler. ``openrange_rllm`` maps each OpenRange episode onto rLLM's
6+
``Episode``/``Step`` and exposes the policy as an ``@rllm.rollout`` flow; rLLM's
7+
gateway captures token ids and logprobs, GRPO does the rest. The reward is the
8+
pack's own dense subgoal ladder (no reward logic here).
9+
10+
A pool of command-injection "company" worlds becomes an rLLM dataset (one row per
11+
pentest task, carrying its ``snapshot_id``/``task_id``); ``snapshot_resolver``
12+
maps each sampled rLLM task back to its world. The agent reaches the live webapp
13+
over HTTP from a host shell (PROCESS backing) and composes ``curl`` itself.
14+
15+
Run on one CUDA GPU through rLLM's verl backend. Validated end to end on an
16+
A100-40GB inside the maintainers' ``verlai/verl:vllm011.latest`` image (torch 2.8
17+
/ vLLM 0.11 / flash-attn)::
18+
19+
python -m examples.rllm_grpo_cyber \
20+
rllm/backend=verl algorithm.adv_estimator=grpo \
21+
+model.name=Qwen/Qwen2.5-7B-Instruct \
22+
actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \
23+
actor_rollout_ref.model.lora_rank=32 \
24+
actor_rollout_ref.model.lora_alpha=32 \
25+
actor_rollout_ref.actor.use_dynamic_bsz=True \
26+
actor_rollout_ref.actor.ppo_max_token_len_per_gpu=16384 \
27+
actor_rollout_ref.actor.use_kl_loss=False \
28+
actor_rollout_ref.rollout.name=vllm \
29+
actor_rollout_ref.rollout.mode=async \
30+
actor_rollout_ref.rollout.enforce_eager=True \
31+
actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \
32+
actor_rollout_ref.rollout.n=4 \
33+
actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
34+
actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
35+
trainer.n_gpus_per_node=1 data.train_batch_size=2 \
36+
rllm.trainer.total_batches=1
37+
38+
Gotchas (both cost real debugging):
39+
40+
- LoRA uses the **flat** keys ``lora_rank`` / ``lora_alpha``. The nested
41+
``lora.rank`` is silently ignored, which means full fine-tuning — a 7B then
42+
OOMs a 40GB card, whereas with LoRA on it fits comfortably.
43+
- OpenRange currently requires Python **3.14** (PEP 758 ``except`` syntax) but the
44+
verl GPU stack ships only Python **3.12** wheels, so they cannot share a
45+
process. Running this needs OpenRange made 3.12-compatible (a 3-line change) or
46+
3.14 wheels for torch/vLLM/flash-attn (not available yet).
47+
48+
Building the pool + registering the dataset is CPU-only; ``trainer.train()`` is
49+
the CUDA boundary (Ray + vLLM + FSDP under the verl backend).
50+
"""
51+
52+
from __future__ import annotations
53+
54+
import subprocess
55+
from collections.abc import Mapping
56+
from typing import Any
57+
58+
from cyber_webapp import WebappPack
59+
from cyber_webapp.difficulty import world_difficulty
60+
from openrange_pack_sdk import Snapshot
61+
from openrange_rllm import (
62+
build_rllm_dataset_rows,
63+
make_evaluator,
64+
make_rollout,
65+
snapshot_resolver,
66+
)
67+
68+
from openrange import EpisodeService, EvalPool, WorldPool
69+
from openrange.core.sandbox import CommandResult
70+
71+
_DATASET = "openrange_cyber"
72+
_FAMILY = "webapp.pentest"
73+
_RUN_ROOT = "or-runs/cyber-rllm"
74+
75+
76+
def _company(seed: int) -> dict[str, Any]:
77+
return {
78+
"world": {"goal": "recover the hidden flag from the internal estate"},
79+
"pack": {"id": "webapp"},
80+
"runtime": {"tick": {"mode": "off"}},
81+
"npc": [],
82+
"seed": seed,
83+
"topology": "chain",
84+
}
85+
86+
87+
def _difficulty(snapshot: Snapshot) -> float:
88+
return float(world_difficulty(snapshot.graph))
89+
90+
91+
class _HostRun:
92+
def run(self, command: str, *, timeout: float = 120.0) -> CommandResult:
93+
done = subprocess.run(
94+
["bash", "-lc", command],
95+
capture_output=True,
96+
text=True,
97+
timeout=timeout,
98+
check=False,
99+
)
100+
return CommandResult(done.returncode, done.stdout + done.stderr)
101+
102+
def close(self) -> None:
103+
return None
104+
105+
106+
def _host_bind(_surface: Mapping[str, Any]) -> _HostRun:
107+
return _HostRun()
108+
109+
110+
def main() -> None:
111+
import hydra
112+
from omegaconf import DictConfig
113+
from rllm.data.dataset import DatasetRegistry
114+
from rllm.trainer import AgentTrainer
115+
116+
@hydra.main( # type: ignore[untyped-decorator]
117+
config_path="pkg://rllm.trainer.config",
118+
config_name="unified",
119+
version_base=None,
120+
)
121+
def _train(config: DictConfig) -> None:
122+
pack = WebappPack()
123+
train_pool = WorldPool.seed(
124+
pack,
125+
[_company(seed) for seed in range(4)],
126+
difficulty_fn=_difficulty,
127+
family=_FAMILY,
128+
max_size=8,
129+
)
130+
val_pool = EvalPool.seed(
131+
pack,
132+
[_company(seed) for seed in (7, 8)],
133+
difficulty_fn=_difficulty,
134+
family=_FAMILY,
135+
)
136+
DatasetRegistry.register_dataset(
137+
_DATASET,
138+
build_rllm_dataset_rows(train_pool.snapshots(), family=_FAMILY),
139+
"train",
140+
)
141+
DatasetRegistry.register_dataset(
142+
_DATASET,
143+
build_rllm_dataset_rows(val_pool.snapshots(), family=_FAMILY),
144+
"test",
145+
)
146+
resolve = snapshot_resolver([*train_pool.snapshots(), *val_pool.snapshots()])
147+
service = EpisodeService(pack, _RUN_ROOT)
148+
trainer = AgentTrainer(
149+
backend=config.rllm.get("backend", "verl"),
150+
agent_flow=make_rollout(service, resolve, bind_run=_host_bind),
151+
evaluator=make_evaluator(),
152+
config=config,
153+
train_dataset=DatasetRegistry.load_dataset(_DATASET, "train"),
154+
val_dataset=DatasetRegistry.load_dataset(_DATASET, "test"),
155+
)
156+
trainer.train()
157+
158+
_train()
159+
160+
161+
if __name__ == "__main__": # pragma: no cover
162+
main()

examples/strands_eval.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414

1515
from openrange_pack_sdk import LLMResult, OpenRangeError, Snapshot, TaskSpec
1616

17-
from examples._briefing import agent_briefing
17+
from openrange.agent import agent_briefing
1818
from openrange.core.episode import AgentTurn
1919
from openrange.runtime import EpisodeContext, OpenRangeRun, RunConfig
2020

packages/graphschema/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ description = "Typed property-graph meta-model with declarative schemas."
99
readme = "README.md"
1010
license = "MIT"
1111
license-files = ["LICENSE"]
12-
requires-python = ">=3.14"
12+
requires-python = ">=3.12"
1313
authors = [{ name = "Vecna AI" }]
1414
keywords = ["graph", "ontology", "schema", "property-graph", "typed-graph"]
1515
classifiers = [

packages/openrange-pack-sdk/pyproject.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ version = "0.1.0"
88
description = "Pack-author SDK for OpenRange: Protocols, value types, and base errors that packs depend on. Zero runtime deps on OpenRange."
99
readme = "README.md"
1010
license = "MIT"
11-
requires-python = ">=3.14"
11+
requires-python = ">=3.12"
1212
dependencies = [
1313
"graphschema",
1414
]

packages/openrange-pack-sdk/src/openrange_pack_sdk/_runtime.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,7 @@ def _read_result(self) -> Mapping[str, Any]:
181181
return {}
182182
try:
183183
data = json.loads(result_path.read_text(encoding="utf-8"))
184-
except OSError, ValueError: # ValueError also covers a non-UTF-8 read
184+
except (OSError, ValueError): # ValueError also covers a non-UTF-8 read
185185
return {}
186186
return dict(data) if isinstance(data, Mapping) else {}
187187

packages/openrange-rllm/README.md

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# openrange-rllm
2+
3+
Optional [rLLM](https://github.com/rllm-org/rllm) `AgentTrainer` integration for
4+
OpenRange. OpenRange owns the world and the grade; rLLM owns the RL training
5+
loop. This adapter is the thin seam between them:
6+
7+
- **`agent_rollout_to_episode`** — maps one OpenRange agent rollout onto rLLM's
8+
`Episode` / `Trajectory` / `Step`, one step per harness turn in call order.
9+
- **`make_rollout`** — wraps the harness as an `@rllm.rollout` flow `(task,
10+
config) -> Episode`; it runs one real episode on a shared `EpisodeService`.
11+
- **`make_evaluator`** — surfaces the verifier's grade as an `@rllm.evaluator`.
12+
- **`GatewaySampler`** — a `Sampler` that calls the policy at `config.base_url`
13+
through OpenRange's own OpenAI-compatible backend. rLLM's gateway records token
14+
ids and logprobs, so the rollout leaves those fields empty and rLLM's trace
15+
enrichment fills them.
16+
17+
`import openrange_rllm` pulls **no** rLLM — every rLLM import is local to the
18+
function that needs it. To run the real trainer, install the `train` extra and
19+
construct the trainer in your own script/notebook:
20+
21+
```python
22+
from rllm.trainer import AgentTrainer
23+
from openrange_rllm import make_rollout, make_evaluator
24+
25+
trainer = AgentTrainer(
26+
config=config, # a Hydra DictConfig (backend: verl|tinker)
27+
agent_flow=make_rollout(service, resolve, bind_run=bind_run),
28+
evaluator=make_evaluator(),
29+
train_dataset=train_dataset,
30+
val_dataset=val_dataset,
31+
backend="verl",
32+
)
33+
trainer.train()
34+
```
35+
36+
A complete, runnable example — building a world pool, registering the dataset,
37+
and the validated single-GPU run command — is in
38+
[`examples/rllm_grpo_cyber.py`](../../examples/rllm_grpo_cyber.py).
39+
40+
rLLM is installed from source (`rllm-org/rllm`); the GPU backend (`rllm[verl]`)
41+
needs CUDA. The adapter itself, and its tests, run on CPU against rLLM's
42+
pydantic-only core types.
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
[build-system]
2+
requires = ["hatchling>=1.27"]
3+
build-backend = "hatchling.build"
4+
5+
[project]
6+
name = "openrange-rllm"
7+
version = "0.1.0"
8+
description = "Optional rLLM AgentTrainer integration for OpenRange. Maps an OpenRange agent rollout onto rLLM's Episode/Trajectory/Step and wraps the harness as an @rllm.rollout flow; OpenRange itself stays trainer-agnostic."
9+
readme = "README.md"
10+
license = "MIT"
11+
requires-python = ">=3.12"
12+
dependencies = [
13+
"openrange",
14+
"openrange-pack-sdk",
15+
]
16+
17+
# ``import openrange_rllm`` pulls no rLLM: every rLLM import is local to the
18+
# function that needs it, so the module loads on a plain machine. The live trainer
19+
# (``rllm.trainer.AgentTrainer``) needs rLLM plus its ``verl`` GPU backend
20+
# (torch/vllm/verl) installed from source on a CUDA box — see README. rLLM is not
21+
# published on PyPI, so it is deliberately not declared as a dependency here
22+
# (doing so makes the universal workspace lock unsatisfiable).
23+
24+
[tool.hatch.build.targets.wheel]
25+
packages = ["src/openrange_rllm"]
26+
27+
[tool.uv.sources]
28+
openrange = { workspace = true }
29+
openrange-pack-sdk = { workspace = true }

0 commit comments

Comments
 (0)