You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<!-- .github/pull_request_template.md -->
## 📌 Description
@HumansAnd
<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->
flashinfer-ai#2505 implements mxfp8
for trtllm backend.
However, in SGLang, `--moe-runner-backend flashinfer_trtllm` bypasses
SGLang topk implementation and does not work with expert routing replay
in MoE RL.
We want to implement `mxfp8 x mxfp8` for `cutlass_fused_moe` which works
with MoE RL training.
This PR mainly reuses existing code path for `WMxfp4AMxfp8Quant`:
https://github.com/flashinfer-ai/flashinfer/blob/952b6ab2838d676b4257fcc23bb00f67fdd38efc/csrc/fused_moe/cutlass_backend/flashinfer_cutlass_fused_moe_binding.cu#L1191
## 🔍 Related Issues
<!-- Link any related issues here -->
miles MXFP8/NVFP4 RL roadmap:
radixark/miles#615
SGLang FlashInfer MXFP8 integration:
sgl-project/sglang#18945
## 🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.
### ✅ Pre-commit Checks
- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.
> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).
## 🧪 Tests
- [x] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).
## Reviewer Notes
<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->
<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit
* **New Features**
* Toggleable MXFPX/MXFP8 activation-scaling across MOE inference,
updating workspace sizing, kernel selection, block-scaling and dispatch
to enable MXFP8-aware execution and validation.
* Added MXFP8×MXFP8 quantization mode and emitted MXFPX-aware
GEMM/kernel variants; public APIs now expose an MXFPX/activation-scaling
flag.
* **Tests**
* Added unit tests and helpers for MXFP8 quantization,
packing/dequantization, and end-to-end MXFP8×MXFP8 MOE inference
validation.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
0 commit comments