Skip to content

add fused fp8 moe kernel for low-latency llm inference#49

Open
VAthree wants to merge 1 commit into
Tencent:mainfrom
VAthree:add-fused-fp8-moe-kernel
Open

add fused fp8 moe kernel for low-latency llm inference#49
VAthree wants to merge 1 commit into
Tencent:mainfrom
VAthree:add-fused-fp8-moe-kernel

Conversation

@VAthree
Copy link
Copy Markdown

@VAthree VAthree commented Jun 2, 2026

Summary

This PR adds a fused per-tensor FP8 MoE operator for LLM MoE inference.

The operator fuses routing, Gate-Up GEMM, activation quantization, Down GEMM, and top-k weighted reduction into one pipelined execution path. Existing implementations commonly follow a gather-then-GEMM design: tokens are first sorted by expert and gathered into contiguous memory, then grouped GEMM is launched per expert. Their SM90 kernels usually rely on TMA plus Warp Specialization in a persistent mode to overlap data movement and compute within each CTA. In low-latency scenarios, the extra gather traffic and manually staged pipeline become major overheads.

This implementation restructures the full pipeline:

  • Routing and index preprocessing avoid per-token atomic adds to global counters. Instead, each block first accumulates expert counts in shared memory and reserves contiguous output ranges for each expert, reducing index construction overhead at scale.
  • Gate-Up GEMM reads the original input directly through routing indices, removing the standalone gather step.
  • The GEMM path removes Warp Specialization. The same warp group performs both data movement and compute, shifting latency hiding from an intra-CTA software pipeline to hardware scheduling across CTAs and increasing CTA residency per SM.
  • Activation quantization writes compact expert-ordered output for direct consumption by Down GEMM.
  • The final stage performs top-k weighted reduction.
  • The five stages are chained with PDL to reduce kernel launch overhead, with SM90-specific tiling and launch configurations.

Benchmark

Benchmark scripts are included under:

bench/fused_moe

On NVIDIA H20 with CUDA 13 and PyTorch 2.11.0+cu130, this per-tensor FP8 path was benchmarked against recent vLLM CUTLASS, vLLM Triton, and SGLang backends across DeepSeek-V3, Hunyuan-V3, and Qwen3-235B shapes.

Relative to the median of the compared backends:

  • TP=8 EP=1: about 1.5x to 1.6x faster
  • TP=1 EP=8: about 1.2x to 1.5x faster

The tests show no accuracy regression.

Tests

  • make format-check
  • python3 setup.py build_ext
  • pytest -q tests/test_fuse_moe_cp_async.py tests/test_group_gemm_cp_async.py
  • pytest -q tests/test_fuse_moe_pertensor.py tests/test_group_gemm_pertensor.py
  • pytest -q tests/test_fuse_moe_blockwise.py tests/test_group_gemm_blockwise.py

@VAthree VAthree force-pushed the add-fused-fp8-moe-kernel branch 5 times, most recently from 3249469 to 2c6778b Compare June 3, 2026 08:07
@ZelinMa557
Copy link
Copy Markdown

请问这一版本和之前版本的hpc ops提供的fused_moe接口相比,性能有提升吗?

@VAthree VAthree force-pushed the add-fused-fp8-moe-kernel branch from 2c6778b to 09886e3 Compare June 3, 2026 12:49
@VAthree
Copy link
Copy Markdown
Author

VAthree commented Jun 7, 2026

请问这一版本和之前版本的hpc ops提供的fused_moe接口相比,性能有提升吗?

Yes. This version keeps the FusedMoE interface compatible while improving the underlying implementation. The performance gain is especially noticeable in TP-oriented low-latency scenarios. We have also provided a reproducible benchmark for users to validate the results directly. More detailed numbers will follow. Stay tuned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants