feat: add StatelessProcessGroup to extend collective library#66
feat: add StatelessProcessGroup to extend collective library#66HubertZhang merged 2 commits intoMoonshotAI:mainfrom
Conversation
1b27b3f to
f989a80
Compare
|
@weixiao-huang @HubertZhang pls review this PR test both on npu and cuda.
test the same model using default torch.distributed module.
|
|
It seems this PR should depend on vLLM, this is so heavy and not an elegant way. I think |
默认的还是通信方式还是torch.distributed诶,只有需要跨资源的时候才需要用到StatelessProcessGroup,如果不支持这个的话,没法合入到verl呀😆 ,不支持训推分离的架构 |
|
是否应当设计一个 protocol DistrubutedLib,给 ps 传入一个 dist: DistributedLib 比较好一些?目前这个 import 的写法感觉隔离的还不太够? |
新增了一个 |
In most cases, the logic remains consistent with that before. Only need to depend on vLLM when the cutom distribued module is required. It does not change the way that checkpoint-engien is a lightweight component. |
本地试了下用 # import torch.distributed as dist
import checkpoint_engine.distributed as dist
dist.init_process_group()
dist.all_reduce()
dist.xxxx()如果需要使用 |
|
如果没有其他review意见的话,能否合入下? If no more comments, can this be merged? |
|
话说要不要抽象 StatelessProcessGroup 而非 dist 呢,在 ps 中直接使用封装好的高级方法看起来方便很多?想象中 sub group 的部分可能会复杂一点但是其他的地方应当简单很多? |
没太理解这块的意思,集合通信是调用的PyNcclCommunicator(PyHcclCommunicator)中的NCCLLibrary、HCCLLibrary实现的,StatelessProcessGroup只在init时Communicator用到 |
我仔细看了一下,是否 class Distributed(ABC):
...
@abstractmethod
def sub_group(self, ranks: list[int]) -> "AbstractProcessGroup":
...这两个函数里涉及到通信的地方会简化很多,直接用传进来的 |
我理解当前的抽象方法,符合集合通信惯用的使用方法,这么修改的话不符合使用习惯。 |
vllm 里 StatelessProcessGroup 就是直接用 |
vllm里StatelessProcessGroup只用来传输metadata: https://github.com/vllm-project/vllm/blob/main/vllm/distributed/utils.py#L146 。数据面的传输还是用pynccl(pynccl): https://github.com/vllm-project/vllm/blob/main/examples/offline_inference/rlhf_utils.py#L15 |
哦 我不是说要用 StatelessProcessGroup 做数据面的传输,是说数据面的传输接口 换句话说是希望 |
b0c6ca0 to
47a2561
Compare
6ad7671 to
0901e9f
Compare
|
其他测了一下应该 vllm_nccl 没问题,可以 rebase 一下然后大致按照功能 squash 下吗 |
75268a4 to
d455a21
Compare
我这边也测试了hccl的部分,然后squash成了一个commit,麻烦看下能否合入 |
|
Resolves #71 |
### What does this PR do? Based on ckpt engine abstraction [add checkpoint-engine abstraction](#4775), in this PR, we add kimi_ckpt_engine backend to support both GPU and huawei Ascend NPU. Since establishing communication domains across trainer and rollout workers is required, this PR also depends on the [newly added communication domain support](MoonshotAI/checkpoint-engine#66) in kimi_ckpt_engine. TODO: - [x] Add detailed performance testing results in checkpoint engine README. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: [add Hccl ckpt engine backend](#4885) - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test We have verified the functionality on both GPU and NPU. Performance benchmarks on a 32 NPU environment show promising results; however, due to a lack of available GPU resources, performance data for GPU is still pending. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Co-authored-by: kip-cxj <cuixiaojin@huawei.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What does this PR do? Based on ckpt engine abstraction [add checkpoint-engine abstraction](verl-project#4775), in this PR, we add kimi_ckpt_engine backend to support both GPU and huawei Ascend NPU. Since establishing communication domains across trainer and rollout workers is required, this PR also depends on the [newly added communication domain support](MoonshotAI/checkpoint-engine#66) in kimi_ckpt_engine. TODO: - [x] Add detailed performance testing results in checkpoint engine README. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: [add Hccl ckpt engine backend](verl-project#4885) - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test We have verified the functionality on both GPU and NPU. Performance benchmarks on a 32 NPU environment show promising results; however, due to a lack of available GPU resources, performance data for GPU is still pending. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Co-authored-by: kip-cxj <cuixiaojin@huawei.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
### What does this PR do? Based on ckpt engine abstraction [add checkpoint-engine abstraction](verl-project#4775), in this PR, we add kimi_ckpt_engine backend to support both GPU and huawei Ascend NPU. Since establishing communication domains across trainer and rollout workers is required, this PR also depends on the [newly added communication domain support](MoonshotAI/checkpoint-engine#66) in kimi_ckpt_engine. TODO: - [x] Add detailed performance testing results in checkpoint engine README. ### Checklist Before Starting - [x] Search for similar PRs. Paste at least one query link here: [add Hccl ckpt engine backend](verl-project#4885) - [x] Format the PR title as `[{modules}] {type}: {description}` (This will be checked by the CI) - `{modules}` include `fsdp`, `megatron`, `veomni`, `sglang`, `vllm`, `rollout`, `trainer`, `ci`, `training_utils`, `recipe`, `hardware`, `deployment`, `ray`, `worker`, `single_controller`, `misc`, `perf`, `model`, `algo`, `env`, `tool`, `ckpt`, `doc`, `data`, `cfg`, `reward` - If this PR involves multiple modules, separate them with `,` like `[megatron, fsdp, doc]` - `{type}` is in `feat`, `fix`, `refactor`, `chore`, `test` - If this PR breaks any API (CLI arguments, config, function signature, etc.), add `[BREAKING]` to the beginning of the title. - Example: `[BREAKING][fsdp, megatron] feat: dynamic batching` ### Test We have verified the functionality on both GPU and NPU. Performance benchmarks on a 32 NPU environment show promising results; however, due to a lack of available GPU resources, performance data for GPU is still pending. ### Checklist Before Submitting > [!IMPORTANT] > Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review. - [x] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [x] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting): `pre-commit install && pre-commit run --all-files --show-diff-on-failure --color=always` - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [ ] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: ... - [ ] Once your PR is ready for CI, send a message in [the `ci-request` channel](https://verl-project.slack.com/archives/C091TCESWB1) in [the `verl` Slack workspace](https://join.slack.com/t/verl-project/shared_invite/zt-3855yhg8g-CTkqXu~hKojPCmo7k_yXTQ). (If not accessible, please try [the Feishu group (飞书群)](https://applink.larkoffice.com/client/chat/chatter/add_by_link?link_token=772jd4f1-cd91-441e-a820-498c6614126a).) - [ ] If your PR is related to the `recipe` submodule, please also update the reference to the submodule commit via `git submodule update --remote` or `cd recipe && git pull origin main`. --------- Co-authored-by: kip-cxj <cuixiaojin@huawei.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Motivation
Add stateless communication group. To enable more flexible creation of communication groups and resolve compatibility issues with other programs that also use the
torch.distributed.Current support vllm, while sglang does not yet supprt
pyhccl. Which feature depends on add pyhccl in sglang.If the current approach in accptable, we will provide sglang version soon.
Primary development and architectural design by @x1314aq
Refinement based on community input and bug fixes by @kip-cxj.
Co-authored-by: x1314aq x1314aq@gmail.com