[DeepEP] support M2N by zhoutianzi666 · Pull Request #75582 · PaddlePaddle/Paddle

zhoutianzi666 · 2025-09-28T09:44:07Z

PR Category

Inference

PR Types

New features

Description

支持M2N形式的All2All，用于AFD分离架构中

具体提供4个API
A2E dispatch 的send(A)和receive(E)
E2A combone 的send(E)和receive(A)
具体API的名字见 python/paddle/distributed/communication/deep_ep/buffer.py 中

使用例如，代码摘自FastDeploy中

@singleton
class EPMegaRunner:

    def __init__(self, fd_config):
        rank = paddle.distributed.get_rank()
        num_ranks = paddle.distributed.get_world_size()
        
        self.group = paddle.distributed.new_group(range(num_ranks))
        

        self.a_start_rank = 0
        self.a_num_ranks = fd_config.parallel_config.attn_group.nranks
        self.e_start_rank = self.a_start_rank + self.a_num_ranks
        self.e_num_ranks = num_ranks - self.a_num_ranks


        self.hidden = 8192 
        self.top_k = 8
        self.num_experts = 64
        self.num_max_tokens = 256
        self.use_fp8 = True
        self.rank = paddle.distributed.get_rank()

        num_rdma_ranks = num_ranks // 8
        self.num_ranks = num_ranks
        num_rdma_bytes = deep_ep.M2NBuffer.get_low_latency_rdma_size_hint_two_stage(
            self.num_max_tokens, self.hidden, self.num_ranks, self.a_num_ranks, self.e_num_ranks, self.num_experts, self.top_k
        )
        # num_rdma_bytes = num_rdma_bytes * 3
 
       
        num_nvl_bytes = deep_ep.M2NBuffer.get_low_latency_nvl_size_hint_two_stage(
            self.num_max_tokens, self.hidden, self.num_ranks, self.a_num_ranks, self.e_num_ranks, self.num_experts, self.top_k, self.use_fp8
        )

        paddle.distributed.barrier()

        self.buffer = deep_ep.M2NBuffer(
                self.group,
                self.a_start_rank,
                self.a_num_ranks,
                self.e_start_rank,
                self.e_num_ranks,
                num_nvl_bytes=num_nvl_bytes,
                num_rdma_bytes=num_rdma_bytes,
                low_latency_mode=True,
                num_qps_per_rank=num_rdma_ranks)

P-card-71501

…tion

M2n return recv hook

add m2n demo

…4-10-25moe/Paddle/cmake/third_party.cmake

…-10-25moe/Paddle/paddle/fluid/distributed/collective/deep_ep/kernels/configs.cuh

…/2024-10-25moe/Paddle/paddle/fluid/distributed/collective/deep_ep/config.hpp and /root/paddlejob/workspace/env_run/output/zkk/erniebot-dev/2024-10-25moe/Paddle/paddle/fluid/distributed/collective/deep_ep/kernels/launch.cuh

carryyu

主代码部分后续可以整合到和internode_ll_two_stage.cu复用

tianshuo78520a

LGTM for approval

carryyu and others added 30 commits July 14, 2025 14:08

[Inference All2All] Supports internode_ll_two_stage all2all communica…

c91052c

…tion

[Inference All2All] Modify kMaxNumQPs in internodel_ll_two_stage

6bd8a9e

[Inference All2All] Modify code-style

7ebb4f5

[Inference All2All] Modify code-style

85ad4db

[Inference All2All] fix unit test

f487c0f

[Inference All2All] modify codestyle and enhance unit test

630b95c

[Inference All2All] modify codestyle and enhance unit test

444bb8e

[Inference All2All] supports batch_send and enhance unit test

7bf5f18

lzy test

9f555fc

add return_recv_hook

61a49aa

fixed num_sums

5eeb2f8

Merge pull request #1 from l1351868270/m2n_return_recv_hook

0e44ef6

M2n return recv hook

add m2n support

a534b07

add m2n test

12486b4

update test

7588e44

add support m2n buffer

25b4890

add support m2n buffer

e8475fe

Merge branch 'm2n' of https://github.com/l1351868270/Paddle into m2n

fb8c463

update build environment

9dec8d9

update build environment

93187ac

Merge branch 'm2n' of https://github.com/l1351868270/Paddle into m2n

20ab024

m2n test add data check

1a461cf

add m2n demo

bb91e5a

Merge pull request #2 from l1351868270/m2n_lsl

dad1bec

add m2n demo

pull the latest code

3735ebd

test update

cba4253

update test

89ea91a

update test

b5d5743

update m2n return_recv_hook

f00ca31

update e2a_irecv

7035ac3

zhoutianzi666 added 14 commits October 9, 2025 10:48

restore /root/paddlejob/workspace/env_run/output/zkk/erniebot-dev/202…

d1819ab

…4-10-25moe/Paddle/cmake/third_party.cmake

restore cmake

3abf2fe

restore

b123378

add comment

e953b36

update /root/paddlejob/workspace/env_run/output/zkk/erniebot-dev/2024…

aa9fc7b

…-10-25moe/Paddle/paddle/fluid/distributed/collective/deep_ep/kernels/configs.cuh

not modify

baeec36

add comment

6614880

add comment

5780c65

format code

fe8afde

add comment

7c6fa63

add comment

8f36b6f

format code

2a489cf

format code

51bc1ac

format code

8c54bd3

zhoutianzi666 changed the title ~~M2n dev~~ [DeepEP] support M2N Oct 9, 2025

add test

2bbbb48

l1351868270 approved these changes Oct 9, 2025

View reviewed changes

XieYunshen added the skip-ci: coverage label Oct 10, 2025

zhoutianzi666 commented Oct 10, 2025

View reviewed changes

Comment thread paddle/fluid/distributed/collective/deep_ep/config.hpp

zhoutianzi666 commented Oct 10, 2025

View reviewed changes

Comment thread paddle/fluid/distributed/collective/deep_ep/kernels/configs.cuh

zhoutianzi666 commented Oct 10, 2025

View reviewed changes

Comment thread paddle/fluid/pybind/deep_ep_api.cc

l1351868270 approved these changes Oct 10, 2025

View reviewed changes

carryyu approved these changes Oct 10, 2025

View reviewed changes

qingqing01 approved these changes Oct 11, 2025

View reviewed changes

raindrops2sea approved these changes Oct 13, 2025

View reviewed changes

sneaxiy approved these changes Oct 13, 2025

View reviewed changes

tianshuo78520a approved these changes Oct 13, 2025

View reviewed changes

zhoutianzi666 merged commit 1990bcc into PaddlePaddle:develop Oct 13, 2025
117 of 131 checks passed

SigureMo pushed a commit to cattidea/Paddle that referenced this pull request Oct 14, 2025

[DeepEP] support M2N (PaddlePaddle#75582)

b0fa447

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DeepEP] support M2N#75582

[DeepEP] support M2N#75582
zhoutianzi666 merged 129 commits intoPaddlePaddle:developfrom
zhoutianzi666:m2n_dev

zhoutianzi666 commented Sep 28, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

carryyu left a comment

Uh oh!

tianshuo78520a left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Conversation

zhoutianzi666 commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

Uh oh!

Uh oh!

Uh oh!

carryyu left a comment

Choose a reason for hiding this comment

Uh oh!

tianshuo78520a left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

zhoutianzi666 commented Sep 28, 2025 •

edited

Loading