fix: use correct SMEM capacity for SM120 consumer Blackwell GPUs by brandonmmusic-max · Pull Request #2835 · flashinfer-ai/flashinfer

brandonmmusic-max · 2026-03-20T04:49:28Z

Summary

SM120 consumer Blackwell GPUs (RTX PRO 6000, RTX 5090, RTX 5080) have 99KB shared memory, but the CuTe DSL MoE kernels hardcode "sm_100" for the SMEM capacity lookup, which returns 227KB (the SM100/B200 capacity). This causes _compute_stages() to compute pipeline stage counts based on 2.3x more SMEM than physically available on SM120.

The Bug

# In all 4 Blackwell CuTe DSL grouped GEMM kernels:
self.num_smem_capacity = utils.get_smem_capacity_in_bytes("sm_100")
#                                                         ^^^^^^
# SM100 (B200/B300) = 227KB SMEM
# SM120 (RTX PRO 6000, RTX 5090) = 99KB SMEM

SMEM capacity by architecture:

GPU	SM	SMEM
B200, B300, GB200	SM100	227KB
RTX PRO 6000	SM120	99KB
RTX 5090	SM120	99KB
RTX 5080	SM120	99KB

The Fix

Add get_blackwell_smem_arch() helper in blackwell/utils.py that auto-detects SM120 vs SM100 via torch.cuda.get_device_capability() and returns the correct architecture string. All 4 affected kernel files now use dynamic detection.

Files Changed

flashinfer/fused_moe/cute_dsl/blackwell/utils.py — new get_blackwell_smem_arch() helper
flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm.py
flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py
flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py

Impact

This fix is only relevant for SM120 consumer Blackwell GPUs. SM100 data center GPUs are unaffected (the helper returns "sm_100" for them, preserving existing behavior).

On SM120, _compute_stages() will now correctly compute 3-5 pipeline stages instead of requesting 7-12 stages that overflow 99KB SMEM. This should improve MoE GEMM throughput for users running NVFP4 models (Qwen3.5-397B, DeepSeek, etc.) on RTX PRO 6000 and RTX 5090 hardware.

Testing

Tested on 4x NVIDIA RTX PRO 6000 Blackwell (SM120, 96GB GDDR7) with Qwen3.5-397B-A17B-NVFP4:

get_smem_capacity_in_bytes("sm_100") returns 232448 (227KB)
get_smem_capacity_in_bytes("sm_120") returns 101376 (99KB)
get_blackwell_smem_arch() correctly returns "sm_120" on this hardware

CUTLASS [Feat] flashinfer.segmented_top_k: Segmented TopK Implementation from CCCL #3096 — SM120 NVFP4 MoE patches (K=64 tile shapes)
FlashInfer fix: Add SM120 (RTX Blackwell desktop) support for NVFP4 MoE kernels #2725 — SM120 MoE capability check fix
FlashInfer feat: K=64 block-scaled MoE GEMM for SM120 (RTX PRO 6000) #2786 — K=64 CUTLASS kernel tiles for SM120

Summary by CodeRabbit

Bug Fixes
- Improved Blackwell GPU kernel shared-memory configuration to dynamically detect device architecture capabilities at runtime instead of using fixed settings, enhancing compatibility across different GPU compute capabilities.

SM120 consumer Blackwell GPUs (RTX PRO 6000, RTX 5090) have 99KB shared memory, but the CuTe DSL MoE kernels hardcode sm_100 (227KB) for the SMEM capacity lookup. This causes _compute_stages to over-allocate pipeline stages on SM120, leading to suboptimal performance. Add get_blackwell_smem_arch() helper that auto-detects SM120 vs SM100 and returns the correct architecture string. All 4 Blackwell grouped GEMM kernels now use dynamic detection instead of hardcoded sm_100. Affected hardware: RTX PRO 6000 (SM120), RTX 5090 (SM120), RTX 5080 (SM120) Not affected: B200, B300, GB200 (SM100) — these already have 227KB SMEM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-20T04:49:47Z

📝 Walkthrough

Walkthrough

This PR introduces a new get_blackwell_smem_arch() helper function that dynamically selects Blackwell shared-memory architecture strings based on CUDA device capability, and applies it across four kernel modules to replace hardcoded "sm_100" values in memory capacity calculations.

Changes

Cohort / File(s)	Summary
Shared-memory architecture detection utility `flashinfer/fused_moe/cute_dsl/blackwell/utils.py`	Introduced `get_blackwell_smem_arch()` function that checks CUDA availability and device compute capability, returning `"sm_120"` for compute capability 12 and `"sm_100"` as fallback.
Grouped GEMM kernel modules `flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm.py`, `blockscaled_contiguous_grouped_gemm_finalize_fusion.py`, `blockscaled_contiguous_grouped_gemm_swiglu_fusion.py`, `blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py`	Updated shared-memory capacity selection in kernel `__init__` methods to use `get_blackwell_smem_arch()` instead of hardcoded `"sm_100"`, enabling device-specific SMEM sizing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related issues

SM120 (RTX Blackwell) NVFP4 MoE: CUTLASS Grouped Block-Scaled GEMM Produces Invalid Output; FlashInfer Requires Extensive Patching #2723: The new get_blackwell_smem_arch() helper directly enables SM120/Blackwell support for grouped GEMM kernels via runtime device capability detection.

Possibly related PRs

Fix/dsl smem query #2178: Both PRs replace hardcoded "sm_100" architecture strings with dynamic, device-specific SMEM capacity lookup during kernel initialization.
fix: Add SM120 (RTX Blackwell desktop) support for NVFP4 MoE kernels #2725: Both PRs enable SM120/Blackwell support by moving from fixed sm_100 assumptions to SM120-aware architecture selection logic.

Suggested labels

run-ci, op: moe

Suggested reviewers

yzh119
nv-yunzheq
IwakuraRein
jiahanc
aleozlx

Poem

🐇 A Blackwell rabbit hops with glee,
With SM versions one-two-zero, one-zero-zero,
Dynamic SMEM capacity—no more hardcoded foe,
Device-aware kernels steal the show! ✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: fixing SMEM capacity detection for SM120 consumer Blackwell GPUs by replacing hardcoded SM100 values with dynamic detection.
Description check	✅ Passed	The description is comprehensive, including the problem statement, the fix explanation, files changed, impact analysis, and testing verification. All key sections are well-documented.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Important

Merge conflicts detected (Beta)

Resolve merge conflict in branch fix/sm120-smem-capacity

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

📝 Coding Plan

Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Tip

You can disable poems in the walkthrough.

Disable the reviews.poem setting to disable the poems in the walkthrough.

gemini-code-assist · 2026-03-20T04:49:48Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request resolves a critical performance issue on SM120 consumer Blackwell GPUs by correcting the shared memory capacity detection within CuTe DSL MoE kernels. Previously, these kernels assumed a larger shared memory capacity, leading to inefficient resource allocation. The changes introduce dynamic GPU architecture detection, ensuring that kernels utilize the accurate shared memory size, thereby optimizing performance for relevant hardware without impacting data center GPUs.

Highlights

Bug Fix for SM120 GPUs: The CuTe DSL MoE kernels incorrectly hardcoded 'sm_100' for shared memory (SMEM) capacity lookup, leading to an overestimation of SMEM (227KB instead of the actual 99KB) on SM120 consumer Blackwell GPUs (RTX PRO 6000, RTX 5090, RTX 5080).
Dynamic Architecture Detection: A new helper function, get_blackwell_smem_arch(), was added to blackwell/utils.py. This function dynamically detects the GPU's SM architecture (SM100 or SM120) using torch.cuda.get_device_capability() and returns the correct architecture string.
Kernel Updates: All four affected CuTe DSL grouped GEMM kernels have been updated to use the new get_blackwell_smem_arch() function, ensuring that the correct SMEM capacity is used for pipeline stage computations.
Performance Improvement: This fix ensures that _compute_stages() correctly calculates pipeline stages based on the actual 99KB SMEM on SM120 GPUs, preventing SMEM overflow and significantly improving MoE GEMM throughput for users running NVFP4 models on these consumer Blackwell GPUs.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request correctly addresses a bug where the SMEM capacity for SM120 consumer Blackwell GPUs was hardcoded incorrectly. The introduction of the get_blackwell_smem_arch helper function to dynamically detect the SM architecture is a good solution. The changes are applied consistently across all affected kernel files. I've added one suggestion to cache the result of the new helper function to avoid redundant calls, which can improve performance.

gemini-code-assist · 2026-03-20T04:52:53Z

+def get_blackwell_smem_arch() -> str:
+    """Return the correct SM architecture string for SMEM capacity lookup.
+
+    SM100 (B200/B300 data center) has 227KB shared memory.
+    SM120/SM121 (RTX PRO 6000/RTX 5090 consumer) has 99KB shared memory.
+
+    Using the wrong capacity causes _compute_stages to over-allocate pipeline
+    stages that don't fit in physical SMEM, degrading performance on SM120.
+    """
+    import torch
+
+    if not torch.cuda.is_available():
+        return "sm_100"  # fallback
+    major, minor = torch.cuda.get_device_capability()
+    if major == 12:
+        return "sm_120"
+    return "sm_100"


This function may be called multiple times within the same process. To improve performance by avoiding redundant calls to torch.cuda.get_device_capability(), it's a good practice to cache its result. The device capability will not change during the execution of the program. You can use a decorator for this.

Suggested change

def get_blackwell_smem_arch() -> str:

"""Return the correct SM architecture string for SMEM capacity lookup.

SM100 (B200/B300 data center) has 227KB shared memory.

SM120/SM121 (RTX PRO 6000/RTX 5090 consumer) has 99KB shared memory.

Using the wrong capacity causes _compute_stages to over-allocate pipeline

stages that don't fit in physical SMEM, degrading performance on SM120.

"""

import torch

if not torch.cuda.is_available():

return "sm_100" # fallback

major, minor = torch.cuda.get_device_capability()

if major == 12:

return "sm_120"

return "sm_100"

@__import__("functools").lru_cache(maxsize=None)

def get_blackwell_smem_arch() -> str:

"""Return the correct SM architecture string for SMEM capacity lookup.

SM100 (B200/B300 data center) has 227KB shared memory.

SM120/SM121 (RTX PRO 6000/RTX 5090 consumer) has 99KB shared memory.

Using the wrong capacity causes _compute_stages to over-allocate pipeline

stages that don't fit in physical SMEM, degrading performance on SM120.

"""

import torch

if not torch.cuda.is_available():

return "sm_100" # fallback

major, minor = torch.cuda.get_device_capability()

if major == 12:

return "sm_120"

return "sm_100"

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/fused_moe/cute_dsl/blackwell/utils.py`:
- Line 74: The assignment unpacks torch.cuda.get_device_capability() into major
and minor but minor is unused; update the unpacking in the code that calls
torch.cuda.get_device_capability() (the line assigning major, minor) to use an
underscore for the unused value (e.g., major, _minor or major, _) so linters
stop complaining while keeping the major variable intact.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 4a4eccaf-212a-4a64-afda-a3d9f0f405e1

📥 Commits

Reviewing files that changed from the base of the PR and between ad893cf and 862f04a.

📒 Files selected for processing (5)

flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_gather_grouped_gemm_swiglu_fusion.py
flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm.py
flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm_finalize_fusion.py
flashinfer/fused_moe/cute_dsl/blackwell/blockscaled_contiguous_grouped_gemm_swiglu_fusion.py
flashinfer/fused_moe/cute_dsl/blackwell/utils.py

coderabbitai · 2026-03-20T04:53:47Z

+
+    if not torch.cuda.is_available():
+        return "sm_100"  # fallback
+    major, minor = torch.cuda.get_device_capability()


⚠️ Potential issue | 🟡 Minor

Avoid the unused variable warning at Line 74.

minor is unpacked but never used; rename it to _minor (or _) to keep lint clean.

🔧 Minimal fix

- major, minor = torch.cuda.get_device_capability() + major, _minor = torch.cuda.get_device_capability()

🧰 Tools

🪛 Ruff (0.15.6)

[warning] 74-74: Unpacked variable minor is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@flashinfer/fused_moe/cute_dsl/blackwell/utils.py` at line 74, The assignment unpacks torch.cuda.get_device_capability() into major and minor but minor is unused; update the unpacking in the code that calls torch.cuda.get_device_capability() (the line assigning major, minor) to use an underscore for the unused value (e.g., major, _minor or major, _) so linters stop complaining while keeping the major variable intact.

bkryu · 2026-03-25T05:22:26Z

Hi @brandonmmusic-max, I appreciate the PR, but the cute-dsl MoE kernels are intended for SM100 and SM103 only and are not compatible for SM120.

It is not just the shared memory size differences. The tensor core architectures are inherently different, which makes the MoE API not usable for SM120. We may add an SM120 CuTe DSL MoE in the future, but it will require an entire rewrite of existing kernels

brandonmmusic-max · 2026-03-25T13:00:03Z

Hi @brandonmmusic-max, I appreciate the PR, but the cute-dsl MoE kernels are intended for SM100 and SM103 only and are not compatible for SM120.

It is not just the shared memory size differences. The tensor core architectures are inherently different, which makes the MoE API not usable for SM120. We may add an SM120 CuTe DSL MoE in the future, but it will require an entire rewrite of existing kernels

thank you for the response! Just trying to be helpful to open source community. I'll close this!

brandonmmusic-max requested review from IwakuraRein, aleozlx, jiahanc, nv-yunzheq and yzh119 as code owners March 20, 2026 04:49

gemini-code-assist Bot reviewed Mar 20, 2026

View reviewed changes

coderabbitai Bot reviewed Mar 20, 2026

View reviewed changes

brandonmmusic-max closed this Mar 25, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use correct SMEM capacity for SM120 consumer Blackwell GPUs#2835

fix: use correct SMEM capacity for SM120 consumer Blackwell GPUs#2835
brandonmmusic-max wants to merge 1 commit intoflashinfer-ai:mainfrom
brandonmmusic-max:fix/sm120-smem-capacity

brandonmmusic-max commented Mar 20, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Mar 20, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Merge conflicts detected (Beta)

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 20, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 20, 2026

Uh oh!

bkryu commented Mar 25, 2026

Uh oh!

brandonmmusic-max commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

brandonmmusic-max commented Mar 20, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The Bug

The Fix

Files Changed

Impact

Testing

Related

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Merge conflicts detected (Beta)

Uh oh!

gemini-code-assist Bot commented Mar 20, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

bkryu commented Mar 25, 2026

Uh oh!

brandonmmusic-max commented Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brandonmmusic-max commented Mar 20, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Mar 20, 2026 •

edited

Loading