- 
                Notifications
    
You must be signed in to change notification settings  - Fork 3.3k
 
Enable Flashinfer TRTLLM-GEN-MoE FP8 blockwise kernel for Qwen3-Next on Blackwell #12543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
          Summary of ChangesHello @samuellees, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates the Flashinfer TRTLLM-GEN-MoE FP8 blockwise kernel to optimize the performance of Qwen3-Next models, particularly on Blackwell GPUs. The changes introduce dynamic routing method selection for Mixture-of-Experts (MoE) layers and demonstrate notable gains in output token throughput through comprehensive benchmarking. This enhancement aims to leverage specialized hardware capabilities for more efficient inference. Highlights
 Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either  
 Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a  Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
  | 
    
          Summary of ChangesHello @samuellees, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates the Flashinfer TRTLLM-GEN-MoE FP8 blockwise kernel into the system, specifically for the Qwen3-Next model running on Blackwell architecture. The primary goal is to enhance the performance of Mixture-of-Experts (MoE) computations through optimized FP8 quantization and kernel execution. The changes facilitate the use of Flashinfer's specialized MoE operations, leading to improved throughput as validated by comprehensive benchmarking results included in the PR description. Highlights
 Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either  
 Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a  Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
  | 
    
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables the Flashinfer TRTLLM-GEN-MoE FP8 blockwise kernel, specifically for Qwen3-Next on Blackwell hardware, which results in significant performance improvements as shown in the benchmarks. The changes are well-structured, introducing a routing_method_type parameter that is plumbed through the MoE layers to the kernel call. This design is flexible and maintains backward compatibility. The implementation looks solid. I have one suggestion regarding import statements to improve code organization and maintainability.
| from flashinfer import RoutingMethodType | ||
| from flashinfer.fused_moe import trtllm_fp8_block_scale_moe | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For better code organization and to avoid a potential ImportError, it's recommended to move these flashinfer imports to the top of the file. The current local import can cause a runtime crash if flashinfer is not installed, as the logic in should_use_flashinfer_trtllm_moe might incorrectly evaluate to True in that scenario.
Guarding the imports at the top level with is_flashinfer_available would make the code more robust. For example:
# In python/sglang/srt/layers/quantization/fp8.py
# First, add is_flashinfer_available to the import from sglang.srt.utils
from sglang.srt.utils import (
    # ... existing imports
    is_flashinfer_available,
)
if is_flashinfer_available():
    from flashinfer import RoutingMethodType
    from flashinfer.fused_moe import trtllm_fp8_block_scale_moe
else:
    RoutingMethodType = None
    trtllm_fp8_block_scale_moe = NoneThen, you can remove the local imports from this method. This change centralizes dependency management and improves code readability and robustness.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request enables the Flashinfer TRTLLM-GEN-MoE FP8 blockwise kernel for Qwen3-Next models on Blackwell hardware, which shows significant performance improvements in the provided benchmarks. The changes involve adding a routing_method_type to configure the MoE routing method, allowing different models like Qwen to use their specific routing logic. The implementation is clean and correctly applies the new routing type for Qwen models.
I have one suggestion to make the code more robust by ensuring the routing_method_type does not default to None, which could cause runtime errors. Other than that, the changes look good and are a valuable performance enhancement.
| routing_method_type = getattr( | ||
| layer, "routing_method_type", RoutingMethodType.DeepSeekV3 | ||
| ) | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current logic for getting routing_method_type can result in None being passed to the trtllm_fp8_block_scale_moe kernel. If a model using FlashInferFusedMoE does not specify routing_method_type, it defaults to None in FusedMoE.__init__. In this case, getattr(layer, "routing_method_type", ...) will return None.
The kernel previously used a hardcoded value and likely does not handle None, which could lead to a runtime error. To make this more robust and ensure backward compatibility, it's better to explicitly check for None and fall back to RoutingMethodType.DeepSeekV3.
| routing_method_type = getattr( | |
| layer, "routing_method_type", RoutingMethodType.DeepSeekV3 | |
| ) | |
| routing_method_type = getattr(layer, "routing_method_type", None) | |
| if routing_method_type is None: | |
| routing_method_type = RoutingMethodType.DeepSeekV3 | 
Dependency
Require flashinfer-python >= 0.5.0
Usage
Accuracy Tests
Triton
Flashinfer TRTLLM-GEN-MoE
Benchmarking and Profiling
Triton
Flashinfer TRTLLM-GEN-MoE
Motivation
Modifications
Checklist