Online RL for agentic tool-use, using binary process reward signals from environment feedback.
The policy model is deployed as an OpenAI-compatible chat proxy. External environments (e.g. OpenClaw) send multi-turn conversations through this proxy. For each main-line turn, the system:
- Forwards the request to the policy model (served by SGLang) and collects the response along with per-token log-probabilities.
- When the next turn arrives, its user/environment message serves as the "next state" for the previous turn.
- A Process Reward Model (PRM) judges the previous response quality given the next state (could be user or env feedback). It produces
mindependent evaluations via majority vote, scoring each turn as+1(good),-1(bad), or0(neutral). - The majority-voted score becomes the scalar reward for that turn.
- Turns that never receive a next state (i.e. the last turn in a session) are excluded from training (
loss_mask = 0), unless they are the only turn in the session (at-least-one guarantee).
Advantages are computed using Group Relative Policy Optimization (GRPO). For each sample with scalar reward r, the advantage is broadcast uniformly to all response tokens:
No reward normalization is applied (--disable-rewards-normalization).
Standard PPO-style clipped surrogate objective with asymmetric clipping:
where
where
cd slime
bash ../openclaw-rl/run_qwen3_4b_openclaw_rl.shopenclaw-rl/
├── README.md
├── run_qwen3_4b_openclaw_rl.sh # Launch script
├── openclaw_api_server.py # FastAPI proxy + PRM scoring + sample submission
├── openclaw_rollout.py # Async rollout worker (bridges API server ↔ SLIME trainer)
└── results/ # Runtime records (auto-created)