Skip to content

Commit cdd7d87

Browse files
authored
[BREAKING][recipe, ckpt] feat: support parameter sync by checkpoint-engine. only for fully_async mode. (#4427)
### What does this PR do? supporting efficient parameter sync between trainer and rollouter for fully_async mode. > Add **concise** overview of what this PR aims to achieve or accomplish. Reference related GitHub issues and PRs that help with the review. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: ... - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test > For changes that can not be tested by CI (e.g., algorithm implementation, new model support), validate by experiment(s) and show results like training curve plots, evaluation results, etc. ### API and Usage Example > Demonstrate how the API changes if any, and provide usage example(s) if possible. ```python # Add code snippet or script demonstrating how to use this ``` ### Design & Code Changes > Demonstrate the high-level design if this PR is complex, and list the specific changes. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).)
1 parent 16a6c47 commit cdd7d87

File tree

9 files changed

+848
-20
lines changed

9 files changed

+848
-20
lines changed

docs/advance/fully_async.md

Lines changed: 39 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**Author:** `https://github.com/meituan-search`
44

5-
Last updated: 10/18/2025.
5+
Last updated: 12/25/2025.
66

77
This document introduces a fully asynchronous PPO training system that completely decouples the Trainer and Rollouter,
88
supporting asynchronous sample generation and training.
@@ -46,8 +46,8 @@ can significantly improve training efficiency.
4646
* **Parallel Generation and Training**: While the Trainer is training, the Rollouter is generating new samples.
4747
* **Multi-step Asynchronous**: Compared to one step off policy, it supports asynchronous settings from 0.x steps to
4848
multiple steps, making the asynchronous solution more flexible.
49-
* **NCCL Parameter Synchronization**: Uses NCCL communication primitives for parameter communication between Rollouter
50-
and Trainer.
49+
* **NCCL Parameter Synchronization**: Based on the nccl communication primitive, refer to [checkpoint-engine](https://github.com/MoonshotAI/checkpoint-engine) to
50+
achieve efficient parameter synchronization between Rollouter and Trainer.
5151
* **Stream Inference and Training**: Rollouter generates data sample by sample, and data transmission uses a single
5252
sample as the minimum transmission unit.
5353
* **Asynchronous Training and Freshness Control**: By setting the parameter async_training.staleness_threshold, it
@@ -105,6 +105,9 @@ https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_a
105105
| `async_training.partial_rollout` | Whether to perform partial_rollout |
106106
| `async_training.use_rollout_log_probs` | Use log_probs generated by rollout |
107107
| `async_training.compute_prox_log_prob` | Whether to compute log_prob using the training model's parameters during the training phase. | |
108+
| `async_training.checkpoint_engine.enable`| Whether to use checkpoint_engine for accelerating, default `True`|
109+
| `async_training.checkpoint_engine.overlap_broadcast_and_consume` | When use checkpoint_engine, whether to overlap broadcast and load_weights, default `False`|
110+
| `async_training.checkpoint_engine.device_buffer_size_M` | When use checkpoint_engine, the user-specific bucket size (MB), default `4096`|
108111

109112
**Further Explanation:**
110113

@@ -172,6 +175,27 @@ https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_a
172175
Additionally, when compute_prox_log_prob and Rollout Importance Sampling are enabled under mode d
173176
(async stream pipeline with partial rollout), our implementation approximates `Areal's Decoupled PPO`.
174177

178+
* `async_training.checkpoint_engine.enable`
179+
180+
Enabling the checkpoint engine generally reduces synchronization time overhead by more than 60% compared to
181+
the original per-tensor parameter synchronization method. However, assembling buckets incurs additional
182+
temporary GPU memory overhead.
183+
184+
* `async_training.checkpoint_engine.overlap_broadcast_and_consume`
185+
186+
Enabling pipeline between the broadcast and load_weights parameters will allocate additional GPU memory.
187+
Since the main time consumption for parameter synchronization is not in the broadcast and load_weights phases,
188+
but in the parameter generation phase (by megatron or FSDP), this option is off by default.
189+
190+
* `async_training.checkpoint_engine.device_buffer_size_M`
191+
192+
It controls the size of the memory buffer used for synchronization when the checkpoint-engine is enabled.
193+
The actual `bucket_size` = `max(device_buffer_size_M, maximum parameter tensor size)`.
194+
* When enable `overlap_broadcast_and_consume`, the additional device memory overhead of
195+
trainer rank is `3 * bucket_size`and rollout rank is `2 * bucket_size`
196+
* When disable `overlap_broadcast_and_consume`, the additional device memory overhead of
197+
trainer rank is `2 * bucket_size`and rollout rank is `1 * bucket_size`
198+
175199
### Supported Modes
176200

177201
1. on policy pipeline:
@@ -437,6 +461,18 @@ future will be our next focus.
437461

438462
> source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-30B?nw=nwuserhouzg | | |
439463
464+
### checkpoint-engine Ablation Experiment
465+
We tested the single-step parameter synchronization time of the checkpoint-engine on three models: Qwen2.5-Math-7B, Qwen3-30B-A3B, and Qwen3-235B-A22B, using default checkpoint-engine configurations. All experiments were performed on H20 machines, and the Megatron engine was used for training.
466+
| model | trainer rank | rollout rank | checkpoint-engine | total sync time |
467+
|:-----------------:|:--------:|:-------:|:--------------:|:--------------:|
468+
| Qwen2.5-Math-7B | 4 | 4 | False | 0.12s |
469+
| Qwen2.5-Math-7B | 4 | 4 | True | 0.02s |
470+
| Qwen3-30B-A3B | 16 | 16 | False | 15.76s |
471+
| Qwen3-30B-A3B | 16 | 16 | True | 4.38s |
472+
| Qwen3-235B-A22B | 64 | 64 | False | 58.57s |
473+
| Qwen3-235B-A22B | 64 | 64 | True | 23.70s |
474+
475+
440476
## Multi-Turn Tool Calling
441477

442478
Referencing **recipe/retool** and **ToolAgentLoop**, we implemented **AsyncPartialToolAgentLoop**, a multi-turn

recipe/fully_async_policy/README.md

Lines changed: 40 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**Author:** `https://github.com/meituan-search`
44

5-
Last updated: 10/18/2025.
5+
Last updated: 12/25/2025.
66

77
This document introduces a fully asynchronous PPO training system that completely decouples the Trainer and Rollouter,
88
supporting asynchronous sample generation and training.
@@ -46,8 +46,8 @@ can significantly improve training efficiency.
4646
* **Parallel Generation and Training**: While the Trainer is training, the Rollouter is generating new samples.
4747
* **Multi-step Asynchronous**: Compared to one step off policy, it supports asynchronous settings from 0.x steps to
4848
multiple steps, making the asynchronous solution more flexible.
49-
* **NCCL Parameter Synchronization**: Uses NCCL communication primitives for parameter communication between Rollouter
50-
and Trainer.
49+
* **NCCL Parameter Synchronization**: Based on the nccl communication primitive, refer to [checkpoint-engine](https://github.com/MoonshotAI/checkpoint-engine) to
50+
achieve efficient parameter synchronization between Rollouter and Trainer.
5151
* **Stream Inference and Training**: Rollouter generates data sample by sample, and data transmission uses a single
5252
sample as the minimum transmission unit.
5353
* **Asynchronous Training and Freshness Control**: By setting the parameter async_training.staleness_threshold, it
@@ -105,6 +105,9 @@ https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_a
105105
| `async_training.partial_rollout` | Whether to perform partial_rollout |
106106
| `async_training.use_rollout_log_probs` | Use log_probs generated by rollout |
107107
| `async_training.compute_prox_log_prob` | Whether to compute log_prob using the training model's parameters during the training phase. | |
108+
| `async_training.checkpoint_engine.enable`| Whether to use checkpoint_engine for accelerating, default `True`|
109+
| `async_training.checkpoint_engine.overlap_broadcast_and_consume` | When use checkpoint_engine, whether to overlap broadcast and load_weights, default `False`|
110+
| `async_training.checkpoint_engine.device_buffer_size_M` | When use checkpoint_engine, the user-specific bucket size (MB), default `4096`|
108111

109112
**Further Explanation:**
110113

@@ -172,6 +175,28 @@ https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_a
172175
Additionally, when compute_prox_log_prob and Rollout Importance Sampling are enabled under mode d
173176
(async stream pipeline with partial rollout), our implementation approximates `Areal's Decoupled PPO`.
174177

178+
* `async_training.checkpoint_engine.enable`
179+
180+
Enabling the checkpoint engine generally reduces synchronization time overhead by more than 60% compared to
181+
the original per-tensor parameter synchronization method. However, assembling buckets incurs additional
182+
temporary GPU memory overhead.
183+
184+
* `async_training.checkpoint_engine.overlap_broadcast_and_consume`
185+
186+
Enabling pipeline between the broadcast and load_weights parameters will allocate additional GPU memory.
187+
Since the main time consumption for parameter synchronization is not in the broadcast and load_weights phases,
188+
but in the parameter generation phase (by megatron or FSDP), this option is off by default.
189+
190+
* `async_training.checkpoint_engine.device_buffer_size_M`
191+
192+
It controls the size of the memory buffer used for synchronization when the checkpoint-engine is enabled.
193+
The actual `bucket_size` = `max(device_buffer_size_M, maximum parameter tensor size)`.
194+
* When enable `overlap_broadcast_and_consume`, the additional device memory overhead of
195+
trainer rank is `3 * bucket_size`and rollout rank is `2 * bucket_size`
196+
* When disable `overlap_broadcast_and_consume`, the additional device memory overhead of
197+
trainer rank is `2 * bucket_size`and rollout rank is `1 * bucket_size`
198+
199+
175200
### Supported Modes
176201

177202
1. on policy pipeline:
@@ -437,6 +462,18 @@ future will be our next focus.
437462

438463
> source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-30B?nw=nwuserhouzg | | |
439464
465+
### checkpoint-engine Ablation Experiment
466+
We tested the single-step parameter synchronization time of the checkpoint-engine on three models: Qwen2.5-Math-7B, Qwen3-30B-A3B, and Qwen3-235B-A22B, using default checkpoint-engine configurations. All experiments were performed on H20 machines, and the Megatron engine was used for training.
467+
| model | trainer rank | rollout rank | checkpoint-engine | total sync time |
468+
|:-----------------:|:--------:|:-------:|:--------------:|:--------------:|
469+
| Qwen2.5-Math-7B | 4 | 4 | False | 0.12s |
470+
| Qwen2.5-Math-7B | 4 | 4 | True | 0.02s |
471+
| Qwen3-30B-A3B | 16 | 16 | False | 15.76s |
472+
| Qwen3-30B-A3B | 16 | 16 | True | 4.38s |
473+
| Qwen3-235B-A22B | 64 | 64 | False | 58.57s |
474+
| Qwen3-235B-A22B | 64 | 64 | True | 23.70s |
475+
476+
440477
## Multi-Turn Tool Calling
441478

442479
Referencing **recipe/retool** and **ToolAgentLoop**, we implemented **AsyncPartialToolAgentLoop**, a multi-turn

recipe/fully_async_policy/README_zh.md

Lines changed: 30 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
**Author:** `https://github.com/meituan-search`
44

5-
Last updated: 10/17/2025.
5+
Last updated: 12/15/2025.
66

77
本文档介绍了完全异步PPO训练系统,该系统实现了 Trainer 和 Rollouter 的完全解耦,支持异步样本生成和训练。
88
在该系统下,我们使用128卡训练qwen2.5-7B模型取得了2.35x-2.67x的性能提升,同时效果没有显著受到影响。
@@ -33,7 +33,7 @@ rollout的训练, 通过合理设置资源分配情况、参数同步频率等
3333
* **资源隔离**:与使用hybrid_engine不同,Rollouter和Trainer使用分离的计算资源,需要分别指定所占用的资源。
3434
* **生成与训练并行**:Trainer在训练的同时,Rollouter在生成新的样本。
3535
* **多步异步**: 相比 one step off policy 支持0.x步到多步的异步设定,异步方案更加灵活。
36-
* **nccl参数同步**使用nccl通信原语进行Rollouter与Trainer参数的通信
36+
* **nccl参数同步**基于nccl通信原语,参考[checkpoint-engine](https://github.com/MoonshotAI/checkpoint-engine)实现Rollouter与Trainer间的高效参数同步
3737
* **Stream推理与训练**:Rollouter逐样本生成数据,同时数据传输以单个sample为最小传输单位。
3838
* **异步训练与新鲜度控制**:通过设置参数async_training.staleness_threshold,支持使用旧参数生成的样本进行训练。
3939
* **PartialRollout**: Rollouter推理过程支持partial rollout逻辑,通过参数同步时,添加`sleep()``resume()`
@@ -82,6 +82,9 @@ https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_a
8282
| `async_training.partial_rollout` | 是否进行partial_rollout |
8383
| `async_training.use_rollout_log_probs` | 使用rollout产生的log_probs |
8484
| `async_training.compute_prox_log_prob`(experimental) | 是否在train阶段,使用train模型的参数计算token的 log_prob |
85+
| `async_training.checkpoint_engine.enable`| 是否开启checkpoint_engine模式的加速,默认值True |
86+
| `async_training.checkpoint_engine.overlap_broadcast_and_consume` | 启动checkpoint_engine时,是否在参数同步时在broadcast和加载之间使用流水,默认值False|
87+
| `async_training.checkpoint_engine.device_buffer_size_M` | 启动checkpoint_engine时,组装的bucket的大小(MB),默认为4096 |
8588

8689
**进一步的解释:**
8790

@@ -140,6 +143,20 @@ https://github.com/ArronHZG/verl-community/blob/recipe/async_policy/docs/fully_a
140143
此外,在 mode d (async stream pipeline with partial rollout) 的情况下开启 `compute_prox_log_prob` 以及
141144
`Rollout Importance Sampling` 后,我们的实现已近似Areal的 `Decoupled PPO`
142145

146+
* `async_training.checkpoint_engine.enable`
147+
148+
开启checkpoint engine后,相较于原始的逐tensor的参数同步方式,同步时间开销普遍可以降低60%以上。但是组装bucket会带来额外的临时显存开销。
149+
150+
* `async_training.checkpoint_engine.overlap_broadcast_and_consume`
151+
152+
开启参数broadcast和load_weights之间的流水后,会进一步额外申请更多显存。由于目前分析参数同步的主要耗时并非来自broadcast和load_weights阶段,而是在参数生成阶段(由megatron或FSDP),因此该开关默认关闭。
153+
154+
* `async_training.checkpoint_engine.device_buffer_size_M`
155+
156+
控制开启checkpoint engine后,用于同步的显存buffer大小。实际的`bucket_size` = `max(device_buffer_size_M, 最大参数tensor size)`
157+
* 在开启`overlap_broadcast_and_consume`时,trainer节点的临时额外显存开销为 `3 * bucket_size`, rollout节点的临时额外显存开销为`2 * bucket_size`
158+
* 在关闭`overlap_broadcast_and_consume`时,trainer节点的临时额外显存开销为 `2 * bucket_size`, rollout节点的临时额外显存开销为`1 * bucket_size`
159+
143160
### 模式支持
144161

145162
1. on policy pipeline:
@@ -374,6 +391,17 @@ GPU 数量整除,这使得资源调整的灵活性受到影响。此外,随
374391

375392
> source data: https://wandb.ai/hou-zg-meituan/fully-async-policy-30B?nw=nwuserhouzg
376393
394+
### checkpoint-engine参数同步消融实验
395+
我们在Qwen2.5-Math-7B,Qwen3-30B-A3B和Qwen3-235B-A22B三个模型上测试了checkpoint-engine参数同步的单步参数同步耗时,使用的参数均为默认参数配置。实验均在H20机器上完成,并使用megatron训练引擎。
396+
| model | trainer rank | rollout rank | checkpoint-engine | total sync time |
397+
|:-----------------:|:--------:|:-------:|:--------------:|:--------------:|
398+
| Qwen2.5-Math-7B | 4 | 4 | False | 0.12s |
399+
| Qwen2.5-Math-7B | 4 | 4 | True | 0.02s |
400+
| Qwen3-30B-A3B | 16 | 16 | False | 15.76s |
401+
| Qwen3-30B-A3B | 16 | 16 | True | 4.38s |
402+
| Qwen3-235B-A22B | 64 | 64 | False | 58.57s |
403+
| Qwen3-235B-A22B | 64 | 64 | True | 23.70s |
404+
377405
## 多轮工具调用
378406

379407
参考 **recipe/retool****ToolAgentLoop**,我们为 **fully_async_policy** 实现了支持partial rollout的多轮工具调用循环 *

0 commit comments

Comments
 (0)