Skip to content

mit-han-lab/vcpo

Repository files navigation

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Paper Github

Overview

We introduce Variance Controlled Off-Policy Optimization (VCPO), a framework that adds explicit variance-targeted controls for policy-gradient methods in the off-policy setting, enabling stable and scalable Async RL training.

  • ✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO
  • 🚀 2.5x faster Async RL training while matching synchronous RL performance
  • 🧠 Robust training stability under high off-policy settings (at least k=128 steps off-policy)

Async RL pipelines rollout generation with learning, significantly reducing end-to-end training time. But large speedups typically require high policy lag that can cause collapse.

Why? Highly stale rollouts make importance sampling ratios heavy-tailed, so a few trajectories dominate each update and the policy-gradient estimator becomes high-variance. Previous work have tried masking/clipping/whitening IS ratios, algorithmic changes, and system-side changes. These can delay collapse… but still fail at high asynchrony.

To address this, VCPO introduces two techniques to stabilize policy-gradient methods for asynchronous RL training:

  1. ESS-guided step scaling to dampen unreliable updates, following sqrt scaling for AdamW-style optimizers.

$$ \eta_{\text{eff}} \propto \sqrt{\rho_{\text{ess}}}, \qquad \rho_{\text{ess}} \triangleq \frac{\mathrm{ESS}}{B} \triangleq \frac{1}{B}\frac{\left(\sum_{i=1}^{B} w_i\right)^2}{\sum_{i=1}^{B} w_i^2} $$

  1. Closed-form off-policy optimal baseline (OPOB) using gradient norm and importance ratios (no learned critic), , implemented with minimal overhead and compatible with DPxTPxSP:

$$ b_{\text{OPOB}}^\star=\frac{\sum_{i=1}^N w_i^2 |\nabla_\theta \log \pi_\theta(\tau_i)|^2 R_i}{\sum_{i=1}^N w_i^2 |\nabla_\theta \log \pi_\theta(\tau_i)|^2} $$

Results

We use k to denote the maximum sampler–learner policy lag (i.e., k steps off-policy), following the PipelineRL setting. Across math, general reasoning, and tool-use tasks with model sizes from 1.5B to 7B, VCPO enables stable asynchronous training where prior stabilizers fail. In long-context multi-turn RL, VCPO delivers a 2.5× end-to-end speedup while matching synchronous performance.

End-to-end training time vs. validation accuracy for synchronous (k=0) and asynchronous training (lag k).
Here, Steps denotes gradient update steps, and GPU hours ↓ measures total wall-clock time across sampling + training GPUs

Countdown

Method Countdown Acc ↑ Steps GPU hours ↓
Base 1.6% -- --
Sync (k=0) 38.4% 400 143.2
VCPO + Async (k=10) 41.9% 400 89.6

MATH-500

Method MATH-500 Acc ↑ Steps GPU hours ↓
Base 40.2% -- --
Sync (k=0) 72.0% 400 134.4
VCPO + Async (k=10) 71.6% 400 92.8

AIME 2025

Method AIME 2025 Acc ↑ Steps GPU hours ↓
Base 5.3% -- --
Sync (k=0) 26.7% 300 420.2
VCPO + Async (k=2) 27.8% 220 168.9

Async RL already achieves its full speedups at <10-steps off-policy, but we stress-tested far beyond that and found VCPO remains stable up to at least 128 steps off-policy.

Getting Started

VCPO is implemented for the Megatron backend, with core logic in megatron_actor.py, vcpo.py, and staleness_utils.py. Training scripts are under recipe/fully_async_policy/shell/vcpo/.

1. Install — follow the veRL documentation to set up the environment. Specifically, we use Megatron-Core 0.13.1 with vLLM 0.11.0 following the conda installation instructions.

2. Prepare data

hf download lukhuang/vcpo --repo-type dataset --local-dir data

3. Train

Edit the model and data paths in the script, then launch

GSM8K and MATH-500 Experiments

GSM8K experiments use the Qwen2-1.5B model and use the official train-test split.

# Synchronous (k=0)
bash recipe/fully_async_policy/shell/vcpo/gsm8k/synchronous.sh

# Fully asynchronous VCPO (k=12)
bash recipe/fully_async_policy/shell/vcpo/gsm8k/vcpo_k=12.sh

MATH experiments use the Qwen2.5-7B model and use the official train-test split.

# Synchronous
bash recipe/fully_async_policy/shell/vcpo/math/synchronous.sh

# Fully asynchronous training + VCPO
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=10.sh

# Highly off-policy asynchronous training + VCPO
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=16.sh  # k=16 steps off-policy
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=32.sh  # k=32 steps off-policy
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=64.sh  # k=64 steps off-policy
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=128.sh # k=128 steps off-policy

Long-Horizon Tool-Use Experiments

We evaluate long-horizon tool use in the SimpleTIR setting, where the model must interleave reasoning with external tool calls. We train using the DAPO dataset and evaluate on a held-out exam-style benchmark (AIME2025).

# Synchronous
bash recipe/fully_async_policy/shell/vcpo/multiturn/synchronous.sh

# Fully asynchronous VCPO
bash recipe/fully_async_policy/shell/vcpo/multiturn/vcpo_k=2.sh

Citation

If you find this work useful, please consider citing:

@article{huang2026stable,
  title        = {Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs},
  author       = {Luke J. Huang and Zhuoyang Zhang and Qinghao Hu and Shang Yang and Song Han},
  year         = {2026},
  month         = feb,
  eprint       = {2602.17616},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  url          = {https://arxiv.org/abs/2602.17616}
}

License and Attribution

This repository was implemented on top of veRL at commit 15a9b0f58a8be2445417493ae7911439c9700cf2.

It is licensed under the Apache License, Version 2.0. See LICENSE for details.

About

Code for the paper “Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs”

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors