Update bench_mixed_attention.py by happierpig · Pull Request #3 · yzh119/flashinfer-dev

happierpig · 2025-04-14T00:34:20Z

No description provided.

## 📌 Description To fix the following bug: When the CuteDSL MoE kernels were ported from TensorRT-LLM to FlashInfer, the mPtrPermutedIdxToExpandedIdx field was accidentally dropped from the routing kernel's DataBase struct in RoutingKernel.h. TRT-LLM's routing kernel produces three reverse-mapping outputs: 1. mPtrExpandedIdxToPermutedIdx[expandedIdx] = permutedIdx — forward mapping 2. mPtrPermutedIdxToExpandedIdx[permutedIdx] = expandedIdx — reverse to expanded index (token_idx * topk + k) 3. mPtrPermutedIdxToTokenIdx[permutedIdx] = tokenIdx — reverse to token index only FlashInfer's port kept only #1 and #3, dropping #2. The binding in moe_utils_binding.cu then had to wire the Python buffer permuted_idx_to_expanded_idx to the only available reverse-mapping field — mPtrPermutedIdxToTokenIdx — which writes plain tokenIdx instead of expandedIdx. The Impact The CuteDSL kernels (GEMM1 gather, moe_output_memset, GEMM2 finalize) all expect expanded indices and derive the token index via expanded_idx // topk. When they received plain tokenIdx instead, they computed tokenIdx // topk — yielding the wrong A row for gather, wrong zero-init for memset, and wrong scatter position + wrong routing scale for finalize.  ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [ ] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [ ] I have installed the hooks with `pre-commit install`. - [ ] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [ ] Tests have been added or updated as needed. - [ ] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Refactor** * Refined MOE (Mixture of Experts) routing infrastructure by extending index mapping capabilities across multiple kernel implementations to improve internal data flow consistency. * **Tests** * Strengthened accuracy validation thresholds from 0.925 to 0.97 with adjusted error tolerance parameters, ensuring more rigorous testing of MOE operations under FP4 quantization conditions.

happierpig added 3 commits April 13, 2025 17:34

Update bench_mixed_attention.py

53c331e

Update bench_mixed_attention.py

6f5f546

Update bench_mixed_attention.py

4a08441

yzh119 approved these changes Apr 14, 2025

View reviewed changes

yzh119 merged commit 1ca0ec2 into yzh119:holistic-persistent Apr 14, 2025
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update bench_mixed_attention.py#3

Update bench_mixed_attention.py#3
yzh119 merged 3 commits intoyzh119:holistic-persistentfrom
happierpig:patch-3

happierpig commented Apr 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

happierpig commented Apr 14, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants