feat: add StatelessProcessGroup to extend collective library by kip-cxj · Pull Request #66 · MoonshotAI/checkpoint-engine

kip-cxj · 2025-12-16T11:55:41Z

Motivation

Add stateless communication group. To enable more flexible creation of communication groups and resolve compatibility issues with other programs that also use the torch.distributed.
Current support vllm, while sglang does not yet supprt pyhccl. Which feature depends on add pyhccl in sglang.
If the current approach in accptable, we will provide sglang version soon.

Primary development and architectural design by @x1314aq
Refinement based on community input and bug fixes by @kip-cxj.

Co-authored-by: x1314aq x1314aq@gmail.com

x1314aq · 2026-01-07T10:27:49Z

@weixiao-huang @HubertZhang pls review this PR

test both on npu and cuda.

Model	Device Info	device_type	GatherMetas	Update (Broadcast)	Update (P2P)
Qwen3-8b	8xNvidia-A100 TP4	cuda	0.01s	1.28s (1.46GiB)	7.81s (1.72GiB)
Qwen3-8b	8xAscend-A3 TP4	npu	0.02s	1.37s (1.59GiB)	2.02s (1.47GiB)

test the same model using default torch.distributed module.

Model	Device Info	device_type	GatherMetas	Update (Broadcast)	Update (P2P)
Qwen3-8b	8xNvidia-A100 TP4	torch	0.01s	1.15s (1.46GiB)	7.68s (1.71GiB)
Qwen3-8b	8xAscend-A3 TP4	torch	0.03s	1.44s (1.59GiB)	3.83s (1.46GiB)

weixiao-huang · 2026-01-09T07:18:16Z

It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think ps.py should be a lightweight component, which may not depend on other heavy framework

hanhan-networking · 2026-01-09T08:45:12Z

默认的还是通信方式还是torch.distributed诶，只有需要跨资源的时候才需要用到StatelessProcessGroup，如果不支持这个的话，没法合入到verl呀😆 ，不支持训推分离的架构

HubertZhang · 2026-01-09T13:23:23Z

是否应当设计一个 protocol DistrubutedLib，给 ps 传入一个 dist: DistributedLib 比较好一些？目前这个 import 的写法感觉隔离的还不太够？

x1314aq · 2026-01-12T09:13:46Z

新增了一个dist_wrapper.py文件，把处理dist的逻辑放进去了，这样ps.py和update.py就可以通过from dist_wrapper import dist直接使用dist模块，不用关心具体实现

x1314aq · 2026-01-12T09:36:32Z

In most cases, the logic remains consistent with that before. Only need to depend on vLLM when the cutom distribued module is required. It does not change the way that checkpoint-engien is a lightweight component.

x1314aq · 2026-01-13T08:07:58Z

本地试了下用dist_wrapper.py包装有问题，现在改成在distributed/base.py里面把torch.distributed包装进去。用法上只需要把import torch.distributed as dist替换成import checkpoint_engine.distributed as dist。

# import torch.distributed as dist
import checkpoint_engine.distributed as dist

dist.init_process_group()
dist.all_reduce()
dist.xxxx()

如果需要使用custom distributed模块的话，只需要给ps传一个custom_dist=True就行。

x1314aq · 2026-01-15T13:10:58Z

@weixiao-huang @HubertZhang

如果没有其他review意见的话，能否合入下？

If no more comments, can this be merged?

HubertZhang · 2026-01-15T14:27:07Z

话说要不要抽象 StatelessProcessGroup 而非 dist 呢，在 ps 中直接使用封装好的高级方法看起来方便很多？想象中 sub group 的部分可能会复杂一点但是其他的地方应当简单很多？

# dist/vllm.py
from vllm.distributed import StatelessProcessGroup
VLLMStatelessProcessGroup = StatelessProcessGroup

# ps.py
class ParameterServer:
    def __init__(self, grouo):
        self.group = group
        ...
    def gather(self):
        self.group.broadcast(self.metas)

kip-cxj · 2026-01-19T08:08:04Z

没太理解这块的意思，集合通信是调用的PyNcclCommunicator（PyHcclCommunicator）中的NCCLLibrary、HCCLLibrary实现的，StatelessProcessGroup只在init时Communicator用到

pyproject.toml

checkpoint_engine/distributed/base.py

HubertZhang · 2026-01-20T15:48:56Z

我仔细看了一下，是否 ParameterServer.gather_metas 和 ParameterServer.update_weights 这两个函数，接受一个支持 subgroup 的 Distributed 的实现

class Distributed(ABC):
    ...

    @abstractmethod
    def sub_group(self, ranks: list[int]) -> "AbstractProcessGroup":
        ...

这两个函数里涉及到通信的地方会简化很多，直接用传进来的 process_group.all_gather 和 process_group. broadcast

kip-cxj · 2026-01-22T06:46:11Z

我理解当前的抽象方法，符合集合通信惯用的使用方法，这么修改的话不符合使用习惯。

HubertZhang · 2026-01-22T08:44:10Z

vllm 里 StatelessProcessGroup 就是直接用 pg.broadcast 这样的使用方法？dist.broadcast(..., pg) -> pg.broadcast 感觉清楚一点🤔

kip-cxj · 2026-01-22T09:24:17Z

vllm里StatelessProcessGroup只用来传输metadata: https://github.com/vllm-project/vllm/blob/main/vllm/distributed/utils.py#L146 。数据面的传输还是用pynccl(pynccl): https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/rlhf_utils.py#L15

HubertZhang · 2026-01-22T16:27:58Z

哦我不是说要用 StatelessProcessGroup 做数据面的传输，是说数据面的传输接口 broadcast 可以类似pg.broadcast 的设计，在 xxx.broadcast 的 xxx 里就已经封装了 sub_ranks 以及对应的 process_group 和 nccl_comm_t ，不需要在 broadcast 的参数里再传递 pg 了。

换句话说是希望 xxx.broadcast 里 xxx 是个对象而非一个模块 😂

checkpoint_engine/ps.py

checkpoint_engine/distributed/base.py

checkpoint_engine/distributed/nccl.py

checkpoint_engine/distributed/base.py

checkpoint_engine/distributed/nccl.py

checkpoint_engine/distributed/vllm_nccl.py

checkpoint_engine/distributed/hccl.py

checkpoint_engine/ps.py

HubertZhang · 2026-01-30T05:39:33Z

其他测了一下应该 vllm_nccl 没问题，可以 rebase 一下然后大致按照功能 squash 下吗

kip-cxj · 2026-01-30T08:10:48Z

我这边也测试了hccl的部分，然后squash成了一个commit，麻烦看下能否合入

HubertZhang · 2026-02-02T08:50:18Z

Resolves #71

### What does this PR do? Based on ckpt engine abstraction [add checkpoint-engine abstraction](#4775), in this PR, we add kimi_ckpt_engine backend to support both GPU and huawei Ascend NPU. Since establishing communication domains across trainer and rollout workers is required, this PR also depends on the [newly added communication domain support](MoonshotAI/checkpoint-engine#66) in kimi_ckpt_engine. TODO: - [x] Add detailed performance testing results in checkpoint engine README. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: [add Hccl ckpt engine backend](#4885) - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test We have verified the functionality on both GPU and NPU. Performance benchmarks on a 32 NPU environment show promising results; however, due to a lack of available GPU resources, performance data for GPU is still pending. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Co-authored-by: kip-cxj <cuixiaojin@huawei.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

### What does this PR do? Based on ckpt engine abstraction [add checkpoint-engine abstraction](verl-project#4775), in this PR, we add kimi_ckpt_engine backend to support both GPU and huawei Ascend NPU. Since establishing communication domains across trainer and rollout workers is required, this PR also depends on the [newly added communication domain support](MoonshotAI/checkpoint-engine#66) in kimi_ckpt_engine. TODO: - [x] Add detailed performance testing results in checkpoint engine README. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: [add Hccl ckpt engine backend](verl-project#4885) - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test We have verified the functionality on both GPU and NPU. Performance benchmarks on a 32 NPU environment show promising results; however, due to a lack of available GPU resources, performance data for GPU is still pending. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Co-authored-by: kip-cxj <cuixiaojin@huawei.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

kip-cxj force-pushed the main branch 2 times, most recently from 1b27b3f to f989a80 Compare December 17, 2025 07:12

This was referenced Dec 18, 2025

[wip][BREAKING][recipe, ckpt]add checkpoint engine for one step off policy verl-project/verl#4601

Closed

是否可以接受引入torch.distributed以外的集合通信库？ #71

Closed

kip-cxj changed the title ~~draft: add collective communication for npu~~ draft: add stateless communication for npu Dec 30, 2025

x1314aq force-pushed the main branch from 313ba09 to 533bc5d Compare January 6, 2026 02:47

kip-cxj changed the title ~~draft: add stateless communication for npu~~ feat: Replace torch.distributed with StatelessProcessGroup Jan 8, 2026

kip-cxj changed the title ~~feat: Replace torch.distributed with StatelessProcessGroup~~ feat: add StatelessProcessGroup to extend collective library Jan 8, 2026

x1314aq force-pushed the main branch from 3a3db95 to d9bc500 Compare January 10, 2026 10:01

x1314aq force-pushed the main branch from d522cb4 to 305886f Compare January 13, 2026 09:23

kip-cxj mentioned this pull request Jan 16, 2026

[ckpt] feat: add kimi ckpt engine backend verl-project/verl#4954

Merged

9 tasks

kip-cxj force-pushed the main branch from 0cdd61b to 2846606 Compare January 19, 2026 08:31

weixiao-huang reviewed Jan 19, 2026

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

checkpoint_engine/distributed/base.py Outdated Show resolved Hide resolved

HubertZhang reviewed Jan 26, 2026

View reviewed changes

kip-cxj force-pushed the main branch from faf1dd0 to 1739ee8 Compare January 28, 2026 03:42

kip-cxj force-pushed the main branch 6 times, most recently from b0c6ca0 to 47a2561 Compare January 28, 2026 07:28

HubertZhang reviewed Jan 28, 2026

View reviewed changes

checkpoint_engine/distributed/vllm_nccl.py Show resolved Hide resolved

checkpoint_engine/distributed/hccl.py Outdated Show resolved Hide resolved

checkpoint_engine/distributed/hccl.py Outdated Show resolved Hide resolved

kip-cxj force-pushed the main branch 4 times, most recently from 6ad7671 to 0901e9f Compare January 29, 2026 11:58

HubertZhang reviewed Jan 30, 2026

View reviewed changes

checkpoint_engine/ps.py Show resolved Hide resolved

kip-cxj force-pushed the main branch 3 times, most recently from 75268a4 to d455a21 Compare January 30, 2026 07:52

add statelessprocessgroup to extend collective library

d2fce90

kip-cxj force-pushed the main branch from d455a21 to d2fce90 Compare January 30, 2026 08:06

HubertZhang approved these changes Jan 30, 2026

View reviewed changes

rename file

8abff7f

HubertZhang merged commit fe57396 into MoonshotAI:main Feb 2, 2026
2 checks passed

Conversation

kip-cxj commented Dec 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Uh oh!

x1314aq commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

weixiao-huang commented Jan 9, 2026

Uh oh!

hanhan-networking commented Jan 9, 2026

Uh oh!

HubertZhang commented Jan 9, 2026

Uh oh!

x1314aq commented Jan 12, 2026

Uh oh!

x1314aq commented Jan 12, 2026

Uh oh!

x1314aq commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

x1314aq commented Jan 15, 2026

Uh oh!

HubertZhang commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kip-cxj commented Jan 19, 2026

Uh oh!

Uh oh!

Uh oh!

HubertZhang commented Jan 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kip-cxj commented Jan 22, 2026

Uh oh!

HubertZhang commented Jan 22, 2026

Uh oh!

kip-cxj commented Jan 22, 2026

Uh oh!

HubertZhang commented Jan 22, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HubertZhang commented Jan 30, 2026

Uh oh!

kip-cxj commented Jan 30, 2026

Uh oh!

Uh oh!

HubertZhang commented Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kip-cxj commented Dec 16, 2025 •

edited

Loading

x1314aq commented Jan 7, 2026 •

edited

Loading

x1314aq commented Jan 13, 2026 •

edited

Loading

HubertZhang commented Jan 15, 2026 •

edited

Loading

HubertZhang commented Jan 20, 2026 •

edited

Loading