Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Overview • Results • Getting Started • Citation

Overview

We introduce Variance Controlled Off-Policy Optimization (VCPO), a framework that adds explicit variance-targeted controls for policy-gradient methods in the off-policy setting, enabling stable and scalable Async RL training.

✨ Seamlessly integrates into common policy-gradient methods like REINFORCE/RLOO/GRPO
🚀 2.5x faster Async RL training while matching synchronous RL performance
🧠 Robust training stability under high off-policy settings (at least k=128 steps off-policy)

Async RL pipelines rollout generation with learning, significantly reducing end-to-end training time. But large speedups typically require high policy lag that can cause collapse.

Why? Highly stale rollouts make importance sampling ratios heavy-tailed, so a few trajectories dominate each update and the policy-gradient estimator becomes high-variance. Previous work have tried masking/clipping/whitening IS ratios, algorithmic changes, and system-side changes. These can delay collapse… but still fail at high asynchrony.

To address this, VCPO introduces two techniques to stabilize policy-gradient methods for asynchronous RL training:

ESS-guided step scaling to dampen unreliable updates, following sqrt scaling for AdamW-style optimizers.

$$ \eta_{\text{eff}} \propto \sqrt{\rho_{\text{ess}}}, \qquad \rho_{\text{ess}} \triangleq \frac{\mathrm{ESS}}{B} \triangleq \frac{1}{B}\frac{\left(\sum_{i=1}^{B} w_i\right)^2}{\sum_{i=1}^{B} w_i^2} $$

Closed-form off-policy optimal baseline (OPOB) using gradient norm and importance ratios (no learned critic), , implemented with minimal overhead and compatible with DPxTPxSP:

$$ b_{\text{OPOB}}^\star=\frac{\sum_{i=1}^N w_i^2 |\nabla_\theta \log \pi_\theta(\tau_i)|^2 R_i}{\sum_{i=1}^N w_i^2 |\nabla_\theta \log \pi_\theta(\tau_i)|^2} $$

Results

We use k to denote the maximum sampler–learner policy lag (i.e., k steps off-policy), following the PipelineRL setting. Across math, general reasoning, and tool-use tasks with model sizes from 1.5B to 7B, VCPO enables stable asynchronous training where prior stabilizers fail. In long-context multi-turn RL, VCPO delivers a 2.5× end-to-end speedup while matching synchronous performance.

End-to-end training time vs. validation accuracy for synchronous (k=0) and asynchronous training (lag k).
Here, Steps denotes gradient update steps, and GPU hours ↓ measures total wall-clock time across sampling + training GPUs

Countdown

Method	Countdown Acc ↑	Steps	GPU hours ↓
Base	1.6%	--	--
Sync (`k=0`)	38.4%	400	143.2
VCPO + Async (`k=10`)	41.9%	400	89.6

MATH-500

Method	MATH-500 Acc ↑	Steps	GPU hours ↓
Base	40.2%	--	--
Sync (`k=0`)	72.0%	400	134.4
VCPO + Async (`k=10`)	71.6%	400	92.8

AIME 2025

Method	AIME 2025 Acc ↑	Steps	GPU hours ↓
Base	5.3%	--	--
Sync (`k=0`)	26.7%	300	420.2
VCPO + Async (`k=2`)	27.8%	220	168.9

Async RL already achieves its full speedups at <10-steps off-policy, but we stress-tested far beyond that and found VCPO remains stable up to at least 128 steps off-policy.

Getting Started

VCPO is implemented for the Megatron backend, with core logic in megatron_actor.py, vcpo.py, and staleness_utils.py. Training scripts are under recipe/fully_async_policy/shell/vcpo/.

1. Install — follow the veRL documentation to set up the environment. Specifically, we use Megatron-Core 0.13.1 with vLLM 0.11.0 following the conda installation instructions.

2. Prepare data

hf download lukhuang/vcpo --repo-type dataset --local-dir data

3. Train

Edit the model and data paths in the script, then launch

GSM8K and MATH-500 Experiments

GSM8K experiments use the Qwen2-1.5B model and use the official train-test split.

# Synchronous (k=0)
bash recipe/fully_async_policy/shell/vcpo/gsm8k/synchronous.sh

# Fully asynchronous VCPO (k=12)
bash recipe/fully_async_policy/shell/vcpo/gsm8k/vcpo_k=12.sh

MATH experiments use the Qwen2.5-7B model and use the official train-test split.

# Synchronous
bash recipe/fully_async_policy/shell/vcpo/math/synchronous.sh

# Fully asynchronous training + VCPO
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=10.sh

# Highly off-policy asynchronous training + VCPO
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=16.sh  # k=16 steps off-policy
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=32.sh  # k=32 steps off-policy
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=64.sh  # k=64 steps off-policy
bash recipe/fully_async_policy/shell/vcpo/math/vcpo_k=128.sh # k=128 steps off-policy

Long-Horizon Tool-Use Experiments

We evaluate long-horizon tool use in the SimpleTIR setting, where the model must interleave reasoning with external tool calls. We train using the DAPO dataset and evaluate on a held-out exam-style benchmark (AIME2025).

# Synchronous
bash recipe/fully_async_policy/shell/vcpo/multiturn/synchronous.sh

# Fully asynchronous VCPO
bash recipe/fully_async_policy/shell/vcpo/multiturn/vcpo_k=2.sh

Citation

If you find this work useful, please consider citing:

@article{huang2026stable,
  title        = {Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs},
  author       = {Luke J. Huang and Zhuoyang Zhang and Qinghao Hu and Shang Yang and Song Han},
  year         = {2026},
  month         = feb,
  eprint       = {2602.17616},
  archivePrefix= {arXiv},
  primaryClass = {cs.LG},
  url          = {https://arxiv.org/abs/2602.17616}
}

License and Attribution

This repository was implemented on top of veRL at commit 15a9b0f58a8be2445417493ae7911439c9700cf2.

It is licensed under the Apache License, Version 2.0. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.gemini		.gemini
.github		.github
.vscode		.vscode
docker		docker
docs		docs
examples		examples
figures		figures
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements-cuda.txt		requirements-cuda.txt
requirements-npu.txt		requirements-npu.txt
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
requirements_transferqueue.txt		requirements_transferqueue.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Overview

Results

Countdown

MATH-500

AIME 2025

Getting Started

GSM8K and MATH-500 Experiments

Long-Horizon Tool-Use Experiments

Citation

License and Attribution

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Stable Asynchrony: Variance-Controlled Off-Policy RL for LLMs

Overview

Results

Countdown

MATH-500

AIME 2025

Getting Started

GSM8K and MATH-500 Experiments

Long-Horizon Tool-Use Experiments

Citation

License and Attribution

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages