Skip to content

fix: use correct SMEM capacity for SM120 consumer Blackwell GPUs#2835

Closed
brandonmmusic-max wants to merge 1 commit intoflashinfer-ai:mainfrom
brandonmmusic-max:fix/sm120-smem-capacity
Closed

fix: use correct SMEM capacity for SM120 consumer Blackwell GPUs#2835
brandonmmusic-max wants to merge 1 commit intoflashinfer-ai:mainfrom
brandonmmusic-max:fix/sm120-smem-capacity

Conversation

@brandonmmusic-max
Copy link
Copy Markdown
Contributor

@brandonmmusic-max brandonmmusic-max commented Mar 20, 2026

Summary

SM120 consumer Blackwell GPUs (RTX PRO 6000, RTX 5090, RTX 5080) have 99KB shared memory, but the CuTe DSL MoE kernels hardcode "sm_100" for the SMEM capacity lookup, which returns 227KB (the SM100/B200 capacity). This causes _compute_stages() to compute pipeline stage counts based on 2.3x more SMEM than physically available on SM120.

The Bug

# In all 4 Blackwell CuTe DSL grouped GEMM kernels:
self.num_smem_capacity = utils.get_smem_capacity_in_bytes("sm_100")
#                                                         ^^^^^^
# SM100 (B200/B300) = 227KB SMEM
# SM120 (RTX PRO 6000, RTX 5090) = 99KB SMEM

SMEM capacity by architecture:

GPU SM SMEM
B200, B300, GB200 SM100 227KB
RTX PRO 6000 SM120 99KB
RTX 5090 SM120 99KB
RTX 5080 SM120 99KB

The Fix

Add get_blackwell_smem_arch() helper in blackwell/utils.py that auto-detects SM120 vs SM100 via torch.cuda.get_device_capability() and returns the correct architecture string. All 4 affected kernel files now use dynamic detection.

Files Changed

  • flashinfer/fused_moe/cute_dsl/blackwell/utils.py — new get_blackwell_smem_arch() helper
  • flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm.py
  • flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
  • flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py
  • flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

Impact

This fix is only relevant for SM120 consumer Blackwell GPUs. SM100 data center GPUs are unaffected (the helper returns "sm_100" for them, preserving existing behavior).

On SM120, _compute_stages() will now correctly compute 3-5 pipeline stages instead of requesting 7-12 stages that overflow 99KB SMEM. This should improve MoE GEMM throughput for users running NVFP4 models (Qwen3.5-397B, DeepSeek, etc.) on RTX PRO 6000 and RTX 5090 hardware.

Testing

Tested on 4x NVIDIA RTX PRO 6000 Blackwell (SM120, 96GB GDDR7) with Qwen3.5-397B-A17B-NVFP4:

  • get_smem_capacity_in_bytes("sm_100") returns 232448 (227KB)
  • get_smem_capacity_in_bytes("sm_120") returns 101376 (99KB)
  • get_blackwell_smem_arch() correctly returns "sm_120" on this hardware

Related

Summary by CodeRabbit

  • Bug Fixes
    • Improved Blackwell GPU kernel shared-memory configuration to dynamically detect device architecture capabilities at runtime instead of using fixed settings, enhancing compatibility across different GPU compute capabilities.

SM120 consumer Blackwell GPUs (RTX PRO 6000, RTX 5090) have 99KB shared
memory, but the CuTe DSL MoE kernels hardcode sm_100 (227KB) for the SMEM
capacity lookup. This causes _compute_stages to over-allocate pipeline
stages on SM120, leading to suboptimal performance.

Add get_blackwell_smem_arch() helper that auto-detects SM120 vs SM100 and
returns the correct architecture string. All 4 Blackwell grouped GEMM
kernels now use dynamic detection instead of hardcoded sm_100.

Affected hardware: RTX PRO 6000 (SM120), RTX 5090 (SM120), RTX 5080 (SM120)
Not affected: B200, B300, GB200 (SM100) — these already have 227KB SMEM

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 20, 2026

📝 Walkthrough

Walkthrough

This PR introduces a new get_blackwell_smem_arch() helper function that dynamically selects Blackwell shared-memory architecture strings based on CUDA device capability, and applies it across four kernel modules to replace hardcoded "sm_100" values in memory capacity calculations.

Changes

Cohort / File(s) Summary
Shared-memory architecture detection utility
flashinfer/fused_moe/cute_dsl/blackwell/utils.py
Introduced get_blackwell_smem_arch() function that checks CUDA availability and device compute capability, returning "sm_120" for compute capability 12 and "sm_100" as fallback.
Grouped GEMM kernel modules
flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm.py, blockscaled_contiguous_grouped_gemm_finalize_fusion.py, blockscaled_contiguous_grouped_gemm_swiglu_fusion.py, blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
Updated shared-memory capacity selection in kernel __init__ methods to use get_blackwell_smem_arch() instead of hardcoded "sm_100", enabling device-specific SMEM sizing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

Possibly related PRs

Suggested labels

run-ci, op: moe

Suggested reviewers

  • yzh119
  • nv-yunzheq
  • IwakuraRein
  • jiahanc
  • aleozlx

Poem

🐇 A Blackwell rabbit hops with glee,
With SM versions one-two-zero, one-zero-zero,
Dynamic SMEM capacity—no more hardcoded foe,
Device-aware kernels steal the show! ✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: fixing SMEM capacity detection for SM120 consumer Blackwell GPUs by replacing hardcoded SM100 values with dynamic detection.
Description check ✅ Passed The description is comprehensive, including the problem statement, the fix explanation, files changed, impact analysis, and testing verification. All key sections are well-documented.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Important

Merge conflicts detected (Beta)

  • Resolve merge conflict in branch fix/sm120-smem-capacity
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

You can disable poems in the walkthrough.

Disable the reviews.poem setting to disable the poems in the walkthrough.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical performance issue on SM120 consumer Blackwell GPUs by correcting the shared memory capacity detection within CuTe DSL MoE kernels. Previously, these kernels assumed a larger shared memory capacity, leading to inefficient resource allocation. The changes introduce dynamic GPU architecture detection, ensuring that kernels utilize the accurate shared memory size, thereby optimizing performance for relevant hardware without impacting data center GPUs.

Highlights

  • Bug Fix for SM120 GPUs: The CuTe DSL MoE kernels incorrectly hardcoded 'sm_100' for shared memory (SMEM) capacity lookup, leading to an overestimation of SMEM (227KB instead of the actual 99KB) on SM120 consumer Blackwell GPUs (RTX PRO 6000, RTX 5090, RTX 5080).
  • Dynamic Architecture Detection: A new helper function, get_blackwell_smem_arch(), was added to blackwell/utils.py. This function dynamically detects the GPU's SM architecture (SM100 or SM120) using torch.cuda.get_device_capability() and returns the correct architecture string.
  • Kernel Updates: All four affected CuTe DSL grouped GEMM kernels have been updated to use the new get_blackwell_smem_arch() function, ensuring that the correct SMEM capacity is used for pipeline stage computations.
  • Performance Improvement: This fix ensures that _compute_stages() correctly calculates pipeline stages based on the actual 99KB SMEM on SM120 GPUs, preventing SMEM overflow and significantly improving MoE GEMM throughput for users running NVFP4 models on these consumer Blackwell GPUs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly addresses a bug where the SMEM capacity for SM120 consumer Blackwell GPUs was hardcoded incorrectly. The introduction of the get_blackwell_smem_arch helper function to dynamically detect the SM architecture is a good solution. The changes are applied consistently across all affected kernel files. I've added one suggestion to cache the result of the new helper function to avoid redundant calls, which can improve performance.

Comment on lines +61 to +77
def get_blackwell_smem_arch() -> str:
"""Return the correct SM architecture string for SMEM capacity lookup.

SM100 (B200/B300 data center) has 227KB shared memory.
SM120/SM121 (RTX PRO 6000/RTX 5090 consumer) has 99KB shared memory.

Using the wrong capacity causes _compute_stages to over-allocate pipeline
stages that don't fit in physical SMEM, degrading performance on SM120.
"""
import torch

if not torch.cuda.is_available():
return "sm_100" # fallback
major, minor = torch.cuda.get_device_capability()
if major == 12:
return "sm_120"
return "sm_100"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This function may be called multiple times within the same process. To improve performance by avoiding redundant calls to torch.cuda.get_device_capability(), it's a good practice to cache its result. The device capability will not change during the execution of the program. You can use a decorator for this.

Suggested change
def get_blackwell_smem_arch() -> str:
"""Return the correct SM architecture string for SMEM capacity lookup.
SM100 (B200/B300 data center) has 227KB shared memory.
SM120/SM121 (RTX PRO 6000/RTX 5090 consumer) has 99KB shared memory.
Using the wrong capacity causes _compute_stages to over-allocate pipeline
stages that don't fit in physical SMEM, degrading performance on SM120.
"""
import torch
if not torch.cuda.is_available():
return "sm_100" # fallback
major, minor = torch.cuda.get_device_capability()
if major == 12:
return "sm_120"
return "sm_100"
@__import__("functools").lru_cache(maxsize=None)
def get_blackwell_smem_arch() -> str:
"""Return the correct SM architecture string for SMEM capacity lookup.
SM100 (B200/B300 data center) has 227KB shared memory.
SM120/SM121 (RTX PRO 6000/RTX 5090 consumer) has 99KB shared memory.
Using the wrong capacity causes _compute_stages to over-allocate pipeline
stages that don't fit in physical SMEM, degrading performance on SM120.
"""
import torch
if not torch.cuda.is_available():
return "sm_100" # fallback
major, minor = torch.cuda.get_device_capability()
if major == 12:
return "sm_120"
return "sm_100"

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/fused_moe/cute_dsl/blackwell/utils.py`:
- Line 74: The assignment unpacks torch.cuda.get_device_capability() into major
and minor but minor is unused; update the unpacking in the code that calls
torch.cuda.get_device_capability() (the line assigning major, minor) to use an
underscore for the unused value (e.g., major, _minor or major, _) so linters
stop complaining while keeping the major variable intact.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4a4eccaf-212a-4a64-afda-a3d9f0f405e1

📥 Commits

Reviewing files that changed from the base of the PR and between ad893cf and 862f04a.

📒 Files selected for processing (5)
  • flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
  • flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm.py
  • flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
  • flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py
  • flashinfer/fused_moe/cute_dsl/blackwell/utils.py


if not torch.cuda.is_available():
return "sm_100" # fallback
major, minor = torch.cuda.get_device_capability()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid the unused variable warning at Line 74.

minor is unpacked but never used; rename it to _minor (or _) to keep lint clean.

🔧 Minimal fix
-    major, minor = torch.cuda.get_device_capability()
+    major, _minor = torch.cuda.get_device_capability()
🧰 Tools
🪛 Ruff (0.15.6)

[warning] 74-74: Unpacked variable minor is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fused_moe/cute_dsl/blackwell/utils.py` at line 74, The assignment
unpacks torch.cuda.get_device_capability() into major and minor but minor is
unused; update the unpacking in the code that calls
torch.cuda.get_device_capability() (the line assigning major, minor) to use an
underscore for the unused value (e.g., major, _minor or major, _) so linters
stop complaining while keeping the major variable intact.

@bkryu
Copy link
Copy Markdown
Collaborator

bkryu commented Mar 25, 2026

Hi @brandonmmusic-max, I appreciate the PR, but the cute-dsl MoE kernels are intended for SM100 and SM103 only and are not compatible for SM120.

It is not just the shared memory size differences. The tensor core architectures are inherently different, which makes the MoE API not usable for SM120. We may add an SM120 CuTe DSL MoE in the future, but it will require an entire rewrite of existing kernels

@brandonmmusic-max
Copy link
Copy Markdown
Contributor Author

Hi @brandonmmusic-max, I appreciate the PR, but the cute-dsl MoE kernels are intended for SM100 and SM103 only and are not compatible for SM120.

It is not just the shared memory size differences. The tensor core architectures are inherently different, which makes the MoE API not usable for SM120. We may add an SM120 CuTe DSL MoE in the future, but it will require an entire rewrite of existing kernels

thank you for the response! Just trying to be helpful to open source community. I'll close this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants