add fused fp8 moe kernel for low-latency llm inference by VAthree · Pull Request #49 · Tencent/hpc-ops

VAthree · 2026-06-02T19:06:59Z

Summary

This PR adds a fused per-tensor FP8 MoE operator for LLM MoE inference.

The operator fuses routing, Gate-Up GEMM, activation quantization, Down GEMM, and top-k weighted reduction into one pipelined execution path. Existing implementations commonly follow a gather-then-GEMM design: tokens are first sorted by expert and gathered into contiguous memory, then grouped GEMM is launched per expert. Their SM90 kernels usually rely on TMA plus Warp Specialization in a persistent mode to overlap data movement and compute within each CTA. In low-latency scenarios, the extra gather traffic and manually staged pipeline become major overheads.

This implementation restructures the full pipeline:

Routing and index preprocessing avoid per-token atomic adds to global counters. Instead, each block first accumulates expert counts in shared memory and reserves contiguous output ranges for each expert, reducing index construction overhead at scale.
Gate-Up GEMM reads the original input directly through routing indices, removing the standalone gather step.
The GEMM path removes Warp Specialization. The same warp group performs both data movement and compute, shifting latency hiding from an intra-CTA software pipeline to hardware scheduling across CTAs and increasing CTA residency per SM.
Activation quantization writes compact expert-ordered output for direct consumption by Down GEMM.
The final stage performs top-k weighted reduction.
The five stages are chained with PDL to reduce kernel launch overhead, with SM90-specific tiling and launch configurations.

Benchmark

Benchmark scripts are included under:

bench/fused_moe

On NVIDIA H20 with CUDA 13 and PyTorch 2.11.0+cu130, this per-tensor FP8 path was benchmarked against recent vLLM CUTLASS, vLLM Triton, and SGLang backends across DeepSeek-V3, Hunyuan-V3, and Qwen3-235B shapes.

Relative to the median of the compared backends:

TP=8 EP=1: about 1.5x to 1.6x faster
TP=1 EP=8: about 1.2x to 1.5x faster

The tests show no accuracy regression.

Tests

make format-check
python3 setup.py build_ext
pytest -q tests/test_fuse_moe_cp_async.py tests/test_group_gemm_cp_async.py
pytest -q tests/test_fuse_moe_pertensor.py tests/test_group_gemm_pertensor.py
pytest -q tests/test_fuse_moe_blockwise.py tests/test_group_gemm_blockwise.py

ZelinMa557 · 2026-06-03T09:54:48Z

请问这一版本和之前版本的hpc ops提供的fused_moe接口相比，性能有提升吗？

VAthree · 2026-06-07T12:39:21Z

请问这一版本和之前版本的hpc ops提供的fused_moe接口相比，性能有提升吗？

Yes. This version keeps the FusedMoE interface compatible while improving the underlying implementation. The performance gain is especially noticeable in TP-oriented low-latency scenarios. We have also provided a reproducible benchmark for users to validate the results directly. More detailed numbers will follow. Stay tuned.

VAthree force-pushed the add-fused-fp8-moe-kernel branch 5 times, most recently from 3249469 to 2c6778b Compare June 3, 2026 08:07

add fused fp8 moe kernel for low-latency llm inference

09886e3

VAthree force-pushed the add-fused-fp8-moe-kernel branch from 2c6778b to 09886e3 Compare June 3, 2026 12:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fused fp8 moe kernel for low-latency llm inference#49

add fused fp8 moe kernel for low-latency llm inference#49
VAthree wants to merge 1 commit into
Tencent:mainfrom
VAthree:add-fused-fp8-moe-kernel

VAthree commented Jun 2, 2026

Uh oh!

ZelinMa557 commented Jun 3, 2026

Uh oh!

VAthree commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

VAthree commented Jun 2, 2026

Summary

Benchmark

Tests

Uh oh!

ZelinMa557 commented Jun 3, 2026

Uh oh!

VAthree commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants