You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We introduce Variance Controlled Off-Policy Optimization (VCPO), a framework that adds explicit variance-targeted controls for off-policy RL, enabling stable and scalable Async RL training.
25
+
We introduce Variance Controlled Off-Policy Optimization (VCPO), a framework that adds explicit variance-targeted controls for policy-gradient methods in the off-policy setting, enabling stable and scalable Async RL training.
26
26
27
27
- ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO
28
-
- 🚀 2.5x faster Async RL training while matching synchronous RL performance
29
-
- 🧠 Robust training stability under high off-policy settings (at least 128 steps off-policy)
28
+
- 🚀 **2.5x** faster Async RL training while matching synchronous RL performance
29
+
- 🧠 Robust training stability under high off-policy settings (**at least k=128** steps off-policy)
30
30
31
-
Async RL pipelines rollout generation with learning promises to achieve significant reductions in end-to-end training time. But achieving these speedups requires highly off-policy training, which often leads to collapse
31
+
Async RL pipelines rollout generation with learning, significantly reducing end-to-end training time. But large speedups typically require high policy lag that can cause collapse.
32
32
33
-
Why? Highly stale rollouts make importance sampling ratios heavy-tailed, so a few trajectories dominate each update and the policy-gradient estimator becomes high-variance. Previous work try masking/clipping/whitening IS ratios, algorithmic changes, and system-side changes. These can delay collapse… but still fail at high asynchrony.
33
+
Why? Highly stale rollouts make importance sampling ratios heavy-tailed, so a few trajectories dominate each update and the policy-gradient estimator becomes high-variance. Previous work have tried masking/clipping/whitening IS ratios, algorithmic changes, and system-side changes. These can delay collapse… but still fail at high asynchrony.
34
34
35
-
To address, VCPO introduces two techniques to stabilize policy-gradient methods for asynchronous RL training:
35
+
To address this, VCPO introduces two techniques to stabilize policy-gradient methods for asynchronous RL training:
36
36
37
37
1.**ESS-guided step scaling** to dampen unreliable updates, following sqrt scaling for AdamW-style optimizers.
38
38
@@ -43,27 +43,27 @@ $$
43
43
$$
44
44
45
45
46
-
2.**Closed-form off-policy optimal baseline (OPOB)** using gradient norm and importance ratios (no learned critic):
46
+
2.**Closed-form off-policy optimal baseline (OPOB)** using gradient norm and importance ratios (no learned critic), , implemented with minimal overhead and compatible with DPxTPxSP:
Across math/general reasoning/tool use tasks and model sizes from 1.5B to 7B, VCPO enables stable training where prior methods fail. In long-context multi-turn RL, this delivers a **2.5x** end-to-end speedup while matching synchronous performance.
53
+
We use `k` to denote the maximum sampler–learner policy lag (i.e., `k` steps off-policy), following the PipelineRL setting. Across math, general reasoning, and tool-use tasks with model sizes from 1.5B to 7B, VCPO enables stable asynchronous training where prior stabilizers fail. In long-context multi-turn RL, VCPO delivers a **2.5×** end-to-end speedup while matching synchronous performance.
55
54
56
55
<palign="center">
57
56
<imgsrc="figures/vcpo_results.png"width="85%" />
58
57
</p>
59
58
60
-
End-to-end training times and validation accuracy for synchronous vs. asynchronous training (lag `k`).
59
+
**End-to-end** training time vs. validation accuracy for synchronous (`k=0`) and asynchronous training (lag `k`).
60
+
Here, **Steps** denotes gradient update steps, and **GPU hours ↓** measures total wall-clock time across sampling + training GPUs
Async RL already achieves its full speedups at <10-steps off-policy, but we stress-tested far beyond that and found VCPO remains stable up to at least **128 steps off-policy**.
101
99
102
100
<palign="center">
@@ -166,11 +164,14 @@ If you find this work useful, please consider citing:
166
164
167
165
```bibtex
168
166
@article{huang2026stable,
169
-
title = {Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs},
170
-
author = {Luke J. Huang and Zhuoyang Zhang and Qinghao Hu and Shang Yang and Song Han},
171
-
year = {2026},
172
-
month = {Feb},
173
-
url = {https://arxiv.org/abs/2602.17616}
167
+
title = {Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs},
168
+
author = {Luke J. Huang and Zhuoyang Zhang and Qinghao Hu and Shang Yang and Song Han},
0 commit comments