Skip to content

Commit c7f030f

Browse files
committed
readme
1 parent 73640b0 commit c7f030f

File tree

1 file changed

+22
-21
lines changed

1 file changed

+22
-21
lines changed

README.md

Lines changed: 22 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
<div align="center">
22

3-
# VCPO: Variance Controlled Policy Optimization for Stable Asynchronous RL
3+
# Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs
44

55
[![Paper](https://img.shields.io/badge/paper-A42C25?style=for-the-badge&logo=arxiv&logoColor=white)](https://arxiv.org/abs/2602.17616)
66
[![Github](https://img.shields.io/badge/VCPO-000000?style=for-the-badge&logo=github&logoColor=000&logoColor=white)](https://github.com/mit-han-lab/vcpo)
@@ -22,17 +22,17 @@
2222

2323
## Overview
2424

25-
We introduce Variance Controlled Off-Policy Optimization (VCPO), a framework that adds explicit variance-targeted controls for off-policy RL, enabling stable and scalable Async RL training.
25+
We introduce Variance Controlled Off-Policy Optimization (VCPO), a framework that adds explicit variance-targeted controls for policy-gradient methods in the off-policy setting, enabling stable and scalable Async RL training.
2626

2727
- ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO
28-
- 🚀 2.5x faster Async RL training while matching synchronous RL performance
29-
- 🧠 Robust training stability under high off-policy settings (at least 128 steps off-policy)
28+
- 🚀 **2.5x** faster Async RL training while matching synchronous RL performance
29+
- 🧠 Robust training stability under high off-policy settings (**at least k=128** steps off-policy)
3030

31-
Async RL pipelines rollout generation with learning promises to achieve significant reductions in end-to-end training time. But achieving these speedups requires highly off-policy training, which often leads to collapse
31+
Async RL pipelines rollout generation with learning, significantly reducing end-to-end training time. But large speedups typically require high policy lag that can cause collapse.
3232

33-
Why? Highly stale rollouts make importance sampling ratios heavy-tailed, so a few trajectories dominate each update and the policy-gradient estimator becomes high-variance. Previous work try masking/clipping/whitening IS ratios, algorithmic changes, and system-side changes. These can delay collapse… but still fail at high asynchrony.
33+
Why? Highly stale rollouts make importance sampling ratios heavy-tailed, so a few trajectories dominate each update and the policy-gradient estimator becomes high-variance. Previous work have tried masking/clipping/whitening IS ratios, algorithmic changes, and system-side changes. These can delay collapse… but still fail at high asynchrony.
3434

35-
To address, VCPO introduces two techniques to stabilize policy-gradient methods for asynchronous RL training:
35+
To address this, VCPO introduces two techniques to stabilize policy-gradient methods for asynchronous RL training:
3636

3737
1. **ESS-guided step scaling** to dampen unreliable updates, following sqrt scaling for AdamW-style optimizers.
3838

@@ -43,27 +43,27 @@ $$
4343
$$
4444

4545

46-
2. **Closed-form off-policy optimal baseline (OPOB)** using gradient norm and importance ratios (no learned critic):
46+
2. **Closed-form off-policy optimal baseline (OPOB)** using gradient norm and importance ratios (no learned critic), , implemented with minimal overhead and compatible with DPxTPxSP:
4747

4848
$$
4949
b_{\text{OPOB}}^\star=\frac{\sum_{i=1}^N w_i^2 \|\nabla_\theta \log \pi_\theta(\tau_i)\|^2 R_i}{\sum_{i=1}^N w_i^2 \|\nabla_\theta \log \pi_\theta(\tau_i)\|^2}
5050
$$
5151

5252
## Results
53-
54-
Across math/general reasoning/tool use tasks and model sizes from 1.5B to 7B, VCPO enables stable training where prior methods fail. In long-context multi-turn RL, this delivers a **2.5x** end-to-end speedup while matching synchronous performance.
53+
We use `k` to denote the maximum sampler–learner policy lag (i.e., `k` steps off-policy), following the PipelineRL setting. Across math, general reasoning, and tool-use tasks with model sizes from 1.5B to 7B, VCPO enables stable asynchronous training where prior stabilizers fail. In long-context multi-turn RL, VCPO delivers a **2.5×** end-to-end speedup while matching synchronous performance.
5554

5655
<p align="center">
5756
<img src="figures/vcpo_results.png" width="85%" />
5857
</p>
5958

60-
End-to-end training times and validation accuracy for synchronous vs. asynchronous training (lag `k`).
59+
**End-to-end** training time vs. validation accuracy for synchronous (`k=0`) and asynchronous training (lag `k`).
60+
Here, **Steps** denotes gradient update steps, and **GPU hours ↓** measures total wall-clock time across sampling + training GPUs
6161

6262
#### Countdown
6363

6464
<div align="center">
6565

66-
| Method | Countdown ↑ | Steps | GPU hours ↓ |
66+
| Method | Countdown Acc | Steps | GPU hours ↓ |
6767
| --- | ---: | ---: | ---: |
6868
| Base | 1.6% | -- | -- |
6969
| Sync (`k=0`) | 38.4% | 400 | 143.2 |
@@ -75,7 +75,7 @@ End-to-end training times and validation accuracy for synchronous vs. asynchrono
7575

7676
<div align="center">
7777

78-
| Method | MATH-500 ↑ | Steps | GPU hours ↓ |
78+
| Method | MATH-500 Acc | Steps | GPU hours ↓ |
7979
| --- | ---: | ---: | ---: |
8080
| Base | 40.2% | -- | -- |
8181
| Sync (`k=0`) | 72.0% | 400 | 134.4 |
@@ -87,16 +87,14 @@ End-to-end training times and validation accuracy for synchronous vs. asynchrono
8787

8888
<div align="center">
8989

90-
| Method | AIME 2025 ↑ | Steps | GPU hours ↓ |
90+
| Method | AIME 2025 Acc | Steps | GPU hours ↓ |
9191
| --- | ---: | ---: | ---: |
9292
| Base | 5.3% | -- | -- |
9393
| Sync (`k=0`) | 26.7% | 300 | 420.2 |
9494
| VCPO + Async (`k=2`) | **27.8%** | 220 | **168.9** |
9595

9696
</div>
9797

98-
99-
10098
Async RL already achieves its full speedups at <10-steps off-policy, but we stress-tested far beyond that and found VCPO remains stable up to at least **128 steps off-policy**.
10199

102100
<p align="center">
@@ -166,11 +164,14 @@ If you find this work useful, please consider citing:
166164

167165
```bibtex
168166
@article{huang2026stable,
169-
title = {Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs},
170-
author = {Luke J. Huang and Zhuoyang Zhang and Qinghao Hu and Shang Yang and Song Han},
171-
year = {2026},
172-
month = {Feb},
173-
url = {https://arxiv.org/abs/2602.17616}
167+
title = {Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs},
168+
author = {Luke J. Huang and Zhuoyang Zhang and Qinghao Hu and Shang Yang and Song Han},
169+
year = {2026},
170+
month = feb,
171+
eprint = {2602.17616},
172+
archivePrefix= {arXiv},
173+
primaryClass = {cs.LG},
174+
url = {https://arxiv.org/abs/2602.17616}
174175
}
175176
```
176177

0 commit comments

Comments
 (0)