Skip to content

Update bench_mixed_attention.py#3

Merged
yzh119 merged 3 commits intoyzh119:holistic-persistentfrom
happierpig:patch-3
Apr 14, 2025
Merged

Update bench_mixed_attention.py#3
yzh119 merged 3 commits intoyzh119:holistic-persistentfrom
happierpig:patch-3

Conversation

@happierpig
Copy link
Copy Markdown
Collaborator

No description provided.

@yzh119 yzh119 merged commit 1ca0ec2 into yzh119:holistic-persistent Apr 14, 2025
1 check failed
cyx-6 pushed a commit that referenced this pull request Feb 28, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

To fix the following bug:
When the CuteDSL MoE kernels were ported from TensorRT-LLM to
FlashInfer, the mPtrPermutedIdxToExpandedIdx field was accidentally
dropped from the routing kernel's DataBase struct in RoutingKernel.h.
TRT-LLM's routing kernel produces three reverse-mapping outputs:

1. mPtrExpandedIdxToPermutedIdx[expandedIdx] = permutedIdx — forward
mapping
2. mPtrPermutedIdxToExpandedIdx[permutedIdx] = expandedIdx — reverse to
expanded index (token_idx * topk + k)
3. mPtrPermutedIdxToTokenIdx[permutedIdx] = tokenIdx — reverse to token
index only

FlashInfer's port kept only #1 and #3, dropping #2. The binding in
moe_utils_binding.cu then had to wire the Python buffer
permuted_idx_to_expanded_idx to the only available reverse-mapping field
— mPtrPermutedIdxToTokenIdx — which writes plain tokenIdx instead of
expandedIdx.
The Impact
The CuteDSL kernels (GEMM1 gather, moe_output_memset, GEMM2 finalize)
all expect expanded indices and derive the token index via expanded_idx
// topk. When they received plain tokenIdx instead, they computed
tokenIdx // topk — yielding the wrong A row for gather, wrong zero-init
for memset, and wrong scatter position + wrong routing scale for
finalize.

<!-- What does this PR do? Briefly describe the changes and why they’re
needed. -->

## 🔍 Related Issues

<!-- Link any related issues here -->

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [ ] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [ ] I have installed the hooks with `pre-commit install`.
- [ ] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [ ] Tests have been added or updated as needed.
- [ ] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Refactor**
* Refined MOE (Mixture of Experts) routing infrastructure by extending
index mapping capabilities across multiple kernel implementations to
improve internal data flow consistency.

* **Tests**
* Strengthened accuracy validation thresholds from 0.925 to 0.97 with
adjusted error tolerance parameters, ensuring more rigorous testing of
MOE operations under FP4 quantization conditions.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants