[NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell) by kaixih · Pull Request #22921 · sgl-project/sglang

kaixih · 2026-04-16T03:09:32Z

[GDN] Add FlashInfer prefill support for SM100+ (Blackwell)

Summary

Extends FlashInfer GDN kernel support to cover the prefill/extend path on SM100+
(Blackwell) hardware, previously raising NotImplementedError. SM90 (Hopper)
prefill was already supported; this PR completes SM100+ coverage.

Accuracy (Qwen3.5-397B-A17B-NVFP4, B200)

gsm8k (200 examples, baseline threshold: 0.95)

Backend	Score
Triton (prefill + decode)	0.985
FlashInfer (prefill + decode)	0.985

GPQA diamond (198 examples, repeat=8, temperature=0.6)

Backend	Scores	Mean
FlashInfer (prefill + decode)	0.848, 0.879, 0.904, 0.879, 0.848, 0.864, 0.869, 0.869	0.870

Throughput Benchmark (B200, Qwen3.5-397B-A17B-NVFP4, TP=8)

More detailed perf numbers in the PR comments below.

Server settings:

--tp-size 8 --max-running-requests 256 --chunked-prefill-size 163840
--mamba-ssm-dtype bfloat16 --mamba-scheduler-strategy no_buffer --mamba-track-interval 128
--attention-backend trtllm_mha --linear-attn-decode-backend flashinfer
--linear-attn-prefill-backend <triton|flashinfer> (varied per run)
--disable-radix-cache --quantization modelopt_fp4

Benchmark settings:

--dataset-name random --random-input-len 8192 --random-output-len 128
--max-concurrency 256 --num-prompts 512

Metric	Triton prefill	FlashInfer prefill	Speedup
Benchmark duration (s)	53.27	50.87	1.05x
Input throughput (tok/s)	78,734	82,445	1.05x
Total throughput (tok/s)	79,964	83,733	1.05x
Mean TTFT (ms)	12,742	12,042	1.06x
Mean TPOT (ms)	109.08	105.14	1.04x

Requirements

FlashInfer >= 0.6.8 (for chunk_gated_delta_rule SM100 path)
nvidia-cutlass-dsl[cu13] >= 4.4.2 (SM100+ only)
CUDA 13 (SM100+ path requires _cuda_major >= 13)

gemini-code-assist · 2026-04-16T03:09:36Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

kaixih · 2026-04-16T03:10:56Z

cc @hlu1 @YAMY1234 @wenscarl

kaixih · 2026-04-16T18:32:19Z

The model has a repeated block pattern of 3× linear attention (GDN) + 1× full attention.
Profiling one such block during prefill:

Backend	Block wall time	GDN prefill (3 layers)	GDN per layer	Kernels/layer
Triton	12,784 µs	1,518 µs (506×3)	506 µs	12
FlashInfer	12,379 µs	1,275 µs (425×3)	425 µs	11
Speedup	1.03x	1.19x	1.19x

The GDN kernel itself is ~19% faster with FlashInfer; the modest system-level gain (~5%)
reflects that GDN is a small fraction of the total forward pass (MoE GEMM, attention,
all-reduce account for the rest).

FlashInfer GDN prefill — kernel breakdown (per layer, 11 launches)

Kernel	Calls	Time
`GatedDeltaNetChunkedKernel` (fused main)	1	328.2 µs
`elementwise_kernel` (bf16 contiguity copy, packed QKV)	3	58.2 µs (19.4 µs each)
`l2norm_fwd_kernel`	2	7.5 µs (3.7 µs each)
`index_elementwise_kernel` (index_copy scatter)	1	2.9 µs
`vectorized_gather_kernel` (state gather)	1	2.5 µs
`vectorized_elementwise_kernel` (exp)	1	2.4 µs
`unrolled_elementwise_kernel` (int64 cast for index_copy)	1	2.2 µs
`vectorized_elementwise_kernel` (clamp)	1	2.0 µs
Total	11	≈406 µs (wall: 425 µs)

Triton GDN prefill — kernel breakdown (per layer, 12 launches)

Kernel	Calls	Time
`chunk_gated_delta_rule_fwd_kernel_h_blockdim64` (main recurrence)	1	257.9 µs
`chunk_fwd_kernel_o` (output projection)	1	63.5 µs
`elementwise_kernel` (bf16 contiguity copy, packed QKV)	3	56.8 µs (18.9 µs each)
`chunk_gated_delta_rule_fwd_kkt_solve_kernel`	1	42.2 µs
`recompute_w_u_fwd_kernel`	1	34.2 µs
`vectorized_elementwise_kernel` (fill bf16)	2	15.6 µs (7.8 µs each)
`l2norm_fwd_kernel`	2	9.0 µs (4.5 µs each)
`chunk_local_cumsum_scalar_kernel`	1	4.8 µs
Total	12	≈484 µs (wall: 506 µs)

The ~80 µs gap between summed kernel times and wall time reflects Python-level kernel
launch overhead (gaps between dispatches). The FlashInfer overhead items above
(packed QKV copies, gather/scatter, l2norm, exp, cast, clamp — ~78 µs) are candidates
for elimination via the upstream improvements listed above.

kaixih · 2026-04-16T18:37:26Z

This PR is ready for review.

hlu1 · 2026-04-16T20:44:56Z

The CuteDSL kernel performance is limited by low parallelism when batch size and number of heads are small, which is clearly shown by the kernel benchmark in flashinfer-ai/flashinfer#3001

Depending on how the prefill benchmark is configured, the e2e speedup will vary a lot. For example, for 1k or 8k ISL and --chunked-prefill-size 163840, and TP4, you get effect batch size 160 and 20 and will hit the higher end of the speedup. But if you set --chunked-prefill-size 8192, the effective batch size will be smaller and will hit the lower end of the speedup. In practice, the real speedup will depend on the real ISL of the workloads, and we likely won't see much speedup for the long ISL workloads.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

hlu1 · 2026-04-21T22:07:23Z

        q_fi = l2norm_fwd(q[0].contiguous())
        k_fi = l2norm_fwd(k[0].contiguous())


We can modify the triton l2norm_fwd kernel to make it support strided inputs to eliminate the contiguous calls

ispobock · 2026-04-22T16:20:35Z

/tag-and-rerun-ci

yuan-luo · 2026-04-22T16:38:08Z

/rerun-failed-ci

yuan-luo · 2026-04-23T06:14:36Z

+        # SM100+ FlashInfer GDN prefill requires CUDA 13+ (CuTe DSL kernel)
+        # for correctness and best performance.
+        prefill = self.linear_attn_prefill_backend or self.linear_attn_backend
+        if (


We'd better add bf16 state dtype validation for SM100+ FlashInfer prefill backend, just like how SM100+ FlashInfer decode backend does:

if ( decode == "flashinfer" and self.mamba_ssm_dtype != "bfloat16" and torch.cuda.is_available() and torch.cuda.get_device_capability()[0] >= 10 ):

Otherwise, the user can then run SM100+ FlashInfer prefill with float32 state, which is unsupported (the module docstring states "SM100+: decode and prefill with bf16 state"), likely causing kernel errors or incorrect results at runtime.

the flashinfer prefill kernel actually supports the fp32. so the current status from flashinfer:

prefill: fp32/bf16 decode: bf16

Note, here I am talking about the "fast" kernels that we recommended for the blackwell (there are some "legacy" kernels that are not the focus of this PR).

So, the below is what is going to happen with the current code:

# if users use fp32 states perfill works but decode will complain # if users use bf16 states both perfill and decode work

yuan-luo · 2026-04-23T14:58:06Z

/rerun-failed-ci

yuan-luo · 2026-04-27T17:40:26Z

/rerun-failed-ci

kaixih requested review from Fridge003, HaiShaw, Qiaolin-Yu, hebiao064, ishandhanani, ispobock, merrymercy and yctseng0211 as code owners April 16, 2026 03:09

kaixih changed the title ~~[NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell)~~ [Draft] [NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell) Apr 16, 2026

kaixih changed the title ~~[Draft] [NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell)~~ [NVIDIA] [GDN] Add FlashInfer prefill support for SM100+ (Blackwell) Apr 16, 2026

hlu1 approved these changes Apr 16, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/attention/linear/kernels/gdn_flashinfer.py Outdated

Comment thread python/sglang/srt/server_args.py Outdated

hlu1 reviewed Apr 16, 2026

View reviewed changes

Comment thread python/sglang/srt/layers/attention/linear/kernels/gdn_flashinfer.py

nvpohanh mentioned this pull request Apr 17, 2026

[Tracking] Qwen3.5-397B (G)B200 Functional Support and Optimizations #20024

Open

kaixih closed this Apr 17, 2026

kaixih reopened this Apr 17, 2026

kaixih and others added 3 commits April 17, 2026 17:25

Add flashinfer GDN prefill

9729886

Lint issue

e843e17

chore: clarify padding-index clamp comment and clean up stale TODO

b6c0d39

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

kaixih force-pushed the add_flashinfer_gdn_prefill branch from 23b04c0 to b6c0d39 Compare April 17, 2026 18:00

hlu1 reviewed Apr 21, 2026

View reviewed changes

github-actions Bot added the run-ci label Apr 22, 2026

yuan-luo reviewed Apr 23, 2026

View reviewed changes

yuan-luo approved these changes Apr 23, 2026

View reviewed changes

arpera mentioned this pull request Apr 23, 2026

[GDN] Enable FI Blackwell GDN prefill kernel vllm-project/vllm#40717

Open

4 tasks

kaixih mentioned this pull request Apr 27, 2026

[feat] add log gate and initial state pool support in blackwell gdn prefill flashinfer-ai/flashinfer#3167

Open

5 tasks

		q_fi = l2norm_fwd(q[0].contiguous())
		k_fi = l2norm_fwd(k[0].contiguous())

Conversation

kaixih commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

[GDN] Add FlashInfer prefill support for SM100+ (Blackwell)

Summary

Accuracy (Qwen3.5-397B-A17B-NVFP4, B200)

Throughput Benchmark (B200, Qwen3.5-397B-A17B-NVFP4, TP=8)

Requirements

Uh oh!

gemini-code-assist Bot commented Apr 16, 2026

Uh oh!

kaixih commented Apr 16, 2026

Uh oh!

kaixih commented Apr 16, 2026

FlashInfer GDN prefill — kernel breakdown (per layer, 11 launches)

Triton GDN prefill — kernel breakdown (per layer, 12 launches)

Uh oh!

kaixih commented Apr 16, 2026

Uh oh!

hlu1 commented Apr 16, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hlu1 Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

ispobock commented Apr 22, 2026

Uh oh!

yuan-luo commented Apr 22, 2026

Uh oh!

yuan-luo Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaixih Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

yuan-luo commented Apr 23, 2026

Uh oh!

yuan-luo commented Apr 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kaixih commented Apr 16, 2026 •

edited

Loading

yuan-luo Apr 23, 2026 •

edited

Loading