We introduce Variance Controlled Off-Policy Optimization (VCPO), a framework that adds explicit variance-targeted controls for policy-gradient methods in the off-policy setting, enabling stable and scalable Async RL training.
- ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO
- 🚀 2.5x faster Async RL training while matching synchronous RL performance
- 🧠 Robust training stability under high off-policy settings (at least k=128 steps off-policy)
Async RL pipelines rollout generation with learning, significantly reducing end-to-end training time. But large speedups typically require high policy lag that can cause collapse.
Why? Highly stale rollouts make importance sampling ratios heavy-tailed, so a few trajectories dominate each update and the policy-gradient estimator becomes high-variance. Previous work have tried masking/clipping/whitening IS ratios, algorithmic changes, and system-side changes. These can delay collapse… but still fail at high asynchrony.
To address this, VCPO introduces two techniques to stabilize policy-gradient methods for asynchronous RL training:
- ESS-guided step scaling to dampen unreliable updates, following sqrt scaling for AdamW-style optimizers.
- Closed-form off-policy optimal baseline (OPOB) using gradient norm and importance ratios (no learned critic), , implemented with minimal overhead and compatible with DPxTPxSP:
We use k to denote the maximum sampler–learner policy lag (i.e., k steps off-policy), following the PipelineRL setting. Across math, general reasoning, and tool-use tasks with model sizes from 1.5B to 7B, VCPO enables stable asynchronous training where prior stabilizers fail. In long-context multi-turn RL, VCPO delivers a 2.5× end-to-end speedup while matching synchronous performance.
End-to-end training time vs. validation accuracy for synchronous (k=0) and asynchronous training (lag k).
Here, Steps denotes gradient update steps, and GPU hours ↓ measures total wall-clock time across sampling + training GPUs
| Method | Countdown Acc ↑ | Steps | GPU hours ↓ |
|---|---|---|---|
| Base | 1.6% | -- | -- |
Sync (k=0) |
38.4% | 400 | 143.2 |
VCPO + Async (k=10) |
41.9% | 400 | 89.6 |
| Method | MATH-500 Acc ↑ | Steps | GPU hours ↓ |
|---|---|---|---|
| Base | 40.2% | -- | -- |
Sync (k=0) |
72.0% | 400 | 134.4 |
VCPO + Async (k=10) |
71.6% | 400 | 92.8 |
| Method | AIME 2025 Acc ↑ | Steps | GPU hours ↓ |
|---|---|---|---|
| Base | 5.3% | -- | -- |
Sync (k=0) |
26.7% | 300 | 420.2 |
VCPO + Async (k=2) |
27.8% | 220 | 168.9 |
Async RL already achieves its full speedups at <10-steps off-policy, but we stress-tested far beyond that and found VCPO remains stable up to at least 128 steps off-policy.
VCPO is implemented for the Megatron backend, with core logic in megatron_actor.py, vcpo.py, and staleness_utils.py. Training scripts are under recipe/fully_async_policy/shell/vcpo/.
1. Install — follow the veRL documentation to set up the environment. Specifically, we use Megatron-Core 0.13.1 with vLLM 0.11.0 following the conda installation instructions.
2. Prepare data
hf download lukhuang/vcpo --repo-type dataset --local-dir data
3. Train
Edit the model and data paths in the script, then launch
GSM8K experiments use the Qwen2-1.5B model and use the official train-test split.
# Synchronous (k=0)
bash recipe/fully_async_policy/shell/vcpo/gsm8k/synchronous.sh
# Fully asynchronous VCPO (k=12)
bash recipe/fully_async_policy/shell/vcpo/gsm8k/vcpo_k=12.shMATH experiments use the Qwen2.5-7B model and use the official train-test split.
# Synchronous
bash recipe/fully_async_policy/shell/vcpo/math/synchronous.sh
# Fully asynchronous training + VCPO
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=10.sh
# Highly off-policy asynchronous training + VCPO
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=16.sh # k=16 steps off-policy
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=32.sh # k=32 steps off-policy
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=64.sh # k=64 steps off-policy
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=128.sh # k=128 steps off-policyWe evaluate long-horizon tool use in the SimpleTIR setting, where the model must interleave reasoning with external tool calls. We train using the DAPO dataset and evaluate on a held-out exam-style benchmark (AIME2025).
# Synchronous
bash recipe/fully_async_policy/shell/vcpo/multiturn/synchronous.sh
# Fully asynchronous VCPO
bash recipe/fully_async_policy/shell/vcpo/multiturn/vcpo_k=2.shIf you find this work useful, please consider citing:
@article{huang2026stable,
title = {Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs},
author = {Luke J. Huang and Zhuoyang Zhang and Qinghao Hu and Shang Yang and Song Han},
year = {2026},
month = feb,
eprint = {2602.17616},
archivePrefix= {arXiv},
primaryClass = {cs.LG},
url = {https://arxiv.org/abs/2602.17616}
}This repository was implemented on top of veRL at commit 15a9b0f58a8be2445417493ae7911439c9700cf2.
It is licensed under the Apache License, Version 2.0. See LICENSE for details.


