add fuse ar rms op by crazy-JiangDongHua · Pull Request #48 · Tencent/hpc-ops

crazy-JiangDongHua · 2026-06-02T09:45:07Z

Fuses tensor-parallel AllReduce, residual add, and RMSNorm into one NVLink-native op — RMSNorm(AllReduce(x) + residual, weight) — avoiding extra kernel launches and HBM round-trips. Built on CUDA multicast (multimem) and P2P. bfloat16, single-node multi-GPU. Two modes are provided, both built on a two-shot (reduce-scatter + all-gather) schedule:

High-throughput mode (fuse_allreduce_rmsnorm_high_throughput): a single fused kernel that performs the reduction over NVSwitch multicast — best for large token counts (prefill).
Low-latency mode (fuse_allreduce_rmsnorm_low_latency): Lamport P2P exchange split into two kernels overlapped via PDL — best for small token counts (decode); requires a power-of-two world size.

See benchmark/bench_allreduce_rmsnorm.py for an 8-GPU comparison against NCCL and FlashInfer.

Co-authored-by: lvjx04 <2108244896@qq.com>

…_bench [benchmark]: support fuse ar + rms

crazy-JiangDongHua force-pushed the Feature/Fuse_AR_RMSNorm branch from 60c508d to d1da6b5 Compare June 3, 2026 11:49

add fuse ar rms op

8a49bd1

Co-authored-by: lvjx04 <2108244896@qq.com>

crazy-JiangDongHua force-pushed the Feature/Fuse_AR_RMSNorm branch from d1da6b5 to 8a49bd1 Compare June 3, 2026 11:51

NayezPasPeur added 5 commits June 4, 2026 12:15

[Benchmark]: tmp support benchmark

1e0690b

--amend

20a9a64

[benchmark]: support benchmark

fa61fb6

[benchmark]: fix filename

8163abf

Merge pull request #1 from crazy-JiangDongHua/Feature/Fuse_AR_RMSNorm…

b8877b7

…_bench [benchmark]: support fuse ar + rms

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fuse ar rms op#48

add fuse ar rms op#48
crazy-JiangDongHua wants to merge 6 commits into
Tencent:mainfrom
crazy-JiangDongHua:Feature/Fuse_AR_RMSNorm

crazy-JiangDongHua commented Jun 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

crazy-JiangDongHua commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crazy-JiangDongHua commented Jun 2, 2026 •

edited

Loading