fix(gdn): use physical SM count for SM100 persistent prefill kernel by arpera · Pull Request #3155 · flashinfer-ai/flashinfer

arpera · 2026-04-23T14:32:21Z

📌 Description

Fixes the num_sm issue CodeRabbit flagged on #3001 but which was not applied before merge: #3001 (comment)

The raw HardwareInfo().get_max_active_clusters(1) call returns 0 / stale values in spawned subprocesses (e.g. vLLM's EngineCore workers) where the CUDA driver API context has not been made current yet. The persistent tile scheduler then leaves some CTAs without any work and the kernel deadlocks at first call. Switch to get_num_sm(q.device), matching the SM120 MoE dispatch.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Refactor
- Kernel compilation now derives device-specific SM and cluster counts at runtime, improving GPU resource allocation and leading to more consistent performance across different CUDA devices.

coderabbitai · 2026-04-23T14:32:42Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a8731a58-4907-459e-a8b5-4c8df701c516

📥 Commits

Reviewing files that changed from the base of the PR and between 1f337a3 and c7f1020.

📒 Files selected for processing (1)

flashinfer/gdn_kernels/blackwell/gdn_prefill.py

📝 Walkthrough

Walkthrough

Kernel compilation for the Blackwell GDN prefill now derives num_sm from get_num_sm(q.device) and sets max_active_clusters equal to that during the initial compile-once path; compile/cache-and-replay control flow and execution logic are unchanged.

Changes

Cohort / File(s)	Summary
Kernel compilation update `flashinfer/gdn_kernels/blackwell/gdn_prefill.py`	Replace use of `cutlass.utils.HardwareInfo` for SM-related values: derive `num_sm` with `get_num_sm(q.device)` and set `max_active_clusters` to the same value for the initial compile-once inputs to `GatedDeltaNetChunkedKernel`. Control flow and execution remain unchanged.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~8 minutes

Possibly related PRs

perf: cache cudaGetDeviceProperties in gdn_prefill to avoid per-call overhead #2509 — Changes around obtaining and caching device SM/attribute information for GDN kernel configuration; closely related to SM/count retrieval logic.

Suggested reviewers

yzh119
bkryu

Poem

🐰 I nudged the kernel, counted SMs with care,
No hardware gossip, just numbers to spare.
Clusters matched neatly, compile-once in sight,
Cached like a burrow, ready for flight. ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: fixing SM count selection for the SM100 persistent prefill kernel by using the physical SM count instead of HardwareInfo.
Description check	✅ Passed	The description includes a detailed explanation of the fix, references the related issue, and confirms completion of pre-commit checks and tests, meeting the template requirements.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request updates the hardware information retrieval in the Blackwell GDN prefill kernel by replacing cutlass_utils.HardwareInfo with specialized utility functions. A potential issue was identified where get_max_active_clusters could return zero in certain environments (e.g., spawned subprocesses), which would lead to kernel launch failures. A suggestion was made to provide a fallback to the total number of SMs in such cases.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/gdn_kernels/blackwell/gdn_prefill.py`:
- Line 38: Remove the unused import get_max_active_clusters and change the
persistent scheduler grid cap logic to use num_sm directly (instead of
min(get_max_active_clusters(1), num_sm)) so the grid shape never becomes
(0,1,1); update the code locations referencing get_max_active_clusters and the
persistent scheduler grid shape (search for get_max_active_clusters and the
variable num_sm and where the persistent scheduler grid is constructed) to
compute cap = num_sm and use that cap when forming the scheduler grid to avoid a
zero dimension.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 2b94f14f-a888-4860-b268-b68dbacc313e

📥 Commits

Reviewing files that changed from the base of the PR and between 9f7adfb and 66e738c.

📒 Files selected for processing (1)

flashinfer/gdn_kernels/blackwell/gdn_prefill.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

coderabbitai

♻️ Duplicate comments (1)

flashinfer/gdn_kernels/blackwell/gdn_prefill.py (1)
38-38: ⚠️ Potential issue | 🔴 Critical

Avoid the stale active-cluster probe entirely.

Line 162 still calls get_max_active_clusters(1). The or num_sm fallback handles 0/None, but a stale positive value still survives through min(...) and can under-cap the persistent scheduler. For this SM100 path, use the physical SM count directly and drop the import.
Proposed fix
-from flashinfer.cute_dsl.utils import get_max_active_clusters, get_num_sm
+from flashinfer.cute_dsl.utils import get_num_sm
@@
         # --- First call: compile the kernel ---
         num_sm = get_num_sm(q.device)
-        max_active_clusters = min(get_max_active_clusters(1) or num_sm, num_sm)
+        max_active_clusters = num_sm
Also applies to: 161-162
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/gdn_kernels/blackwell/gdn_prefill.py` at line 38, Remove the stale
active-cluster probe by deleting the import of get_max_active_clusters (and
get_num_sm if unused) and change the logic that currently calls
get_max_active_clusters(1) (and then uses min(...) with num_sm) to use the
physical SM count directly (the num_sm variable / physical SM-count provider)
when computing the persistent scheduler capacity; ensure any min(...) uses only
the real SM count and adjust variable names accordingly so the persistent
scheduler is never capped by a stale probe value.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@flashinfer/gdn_kernels/blackwell/gdn_prefill.py`:
- Line 38: Remove the stale active-cluster probe by deleting the import of
get_max_active_clusters (and get_num_sm if unused) and change the logic that
currently calls get_max_active_clusters(1) (and then uses min(...) with num_sm)
to use the physical SM count directly (the num_sm variable / physical SM-count
provider) when computing the persistent scheduler capacity; ensure any min(...)
uses only the real SM count and adjust variable names accordingly so the
persistent scheduler is never capped by a stale probe value.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 241377df-4d17-435a-857e-c6f54aca9b61

📥 Commits

Reviewing files that changed from the base of the PR and between 66e738c and 1f337a3.

📒 Files selected for processing (1)

flashinfer/gdn_kernels/blackwell/gdn_prefill.py

jiahanc · 2026-04-23T14:51:28Z

/bot run

jiahanc

Thanks for the fix!

flashinfer-bot · 2026-04-23T14:52:03Z

GitLab MR !589 has been created, and the CI pipeline #49300979 is currently running. I'll report back once the pipeline job completes.

vadiklyutiy · 2026-04-23T15:26:42Z

@jiahanc any idea why we didn't catch it with unit tests?

jiahanc · 2026-04-24T00:08:50Z

@jiahanc any idea why we didn't catch it with unit tests?

could be because the unit test doesnt have as high pressure as framework side, so bug is not exposed

kahyunnam

lgtm, thanks!

## 📌 Description Addresses the two remaining CodeRabbit findings on [#3001](#3001) that weren't applied before merge: * **Normalize `scale=0.0` to the default `1/sqrt(d_k)`** before backend dispatch so the same call gives matching numerics on SM90 and SM100. The SM90 C++ kernel treats `0.0` as a sentinel for "use default", but the SM100 CuTe-DSL kernel forwarded the literal `0.0` → zeroed QK → broken attention. * **Don't eagerly allocate `output_state`** on the SM100 path when `output_final_state=False`. The CuTe-DSL kernel drops the buffer anyway, so the old code wasted a full `[num_seqs, H, 128, 128]` float32 scratch per call. SM90 still allocates unconditionally because its C++ kernel always writes into `output_state`. Dispatcher callsites now pass `output_state` directly on both branches (no inline `output_state if output_final_state else None`), so SM90 and SM100 read identically. ## 🔍 Related Issues * [[feat] Add blackwell GDN prefill kernel](#3001) * [fix(gdn): use physical SM count for SM100 persistent prefill kernel#3155](#3155) * [[fix] fix blackwell gdn accuracy issue#3156](#3156) ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes   ## Summary by CodeRabbit * **Bug Fixes** * Fixed scale parameter handling to correctly interpret explicit values and apply default scaling behavior. * Improved memory efficiency by avoiding unnecessary state allocations in certain configurations. * **Improvements** * Enhanced consistency in kernel invocation logic across different hardware architectures.

fix(gdn): use physical SM count for SM100 persistent prefill kernel

66e738c

arpera requested review from bkryu, kahyunnam, yongwww and yzh119 as code owners April 23, 2026 14:32

gemini-code-assist Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread flashinfer/gdn_kernels/blackwell/gdn_prefill.py Outdated

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread flashinfer/gdn_kernels/blackwell/gdn_prefill.py Outdated

Update flashinfer/gdn_kernels/blackwell/gdn_prefill.py

1f337a3

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

fix(gdn): drop get_max_active_clusters indirection in SM100 prefill

c7f1020

jiahanc added the run-ci label Apr 23, 2026

jiahanc approved these changes Apr 23, 2026

View reviewed changes

arpera mentioned this pull request Apr 24, 2026

fix(gdn): address remaining CodeRabbit feedback from #3001 #3165

Merged

5 tasks

Merge branch 'main' into fix-backwell-gdn

92b4558

arpera mentioned this pull request Apr 24, 2026

[GDN] Enable FI Blackwell GDN prefill kernel vllm-project/vllm#40717

Open

4 tasks

kahyunnam approved these changes Apr 25, 2026

View reviewed changes

kahyunnam merged commit 5e1318c into flashinfer-ai:main Apr 25, 2026
28 of 34 checks passed

Conversation

arpera commented Apr 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Reviewer Notes

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

jiahanc commented Apr 23, 2026

Uh oh!

jiahanc left a comment

Choose a reason for hiding this comment

Uh oh!

flashinfer-bot commented Apr 23, 2026

Uh oh!

vadiklyutiy commented Apr 23, 2026

Uh oh!

jiahanc commented Apr 24, 2026

Uh oh!

kahyunnam left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

arpera commented Apr 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading