Use FI tinygemm for faster BF16 GEMMs by askliar · Pull Request #39921 · vllm-project/vllm

askliar · 2026-04-15T15:31:39Z

Purpose

Route small-M BF16 linears (M ≤ 8) through FlashInfer's tinygemm_bf16 kernel to speed up decode-phase QKV/O, MLP up/down, shared MoE, and latent MoE projections. Uses TRT-LLM's warp-specialized TMA + HMMA design tuned for latency-bound GEMVs. Requires SM90+ and FlashInfer; falls back to F.linear otherwise.

Changes

Add flashinfer_tinygemm_bf16 to the FlashInfer lazy-import registry.
Register torch.ops.vllm.tinygemm_bf16 via direct_register_custom_op (real + fake impls) for torch.compile fullgraph compatibility.
Add _tinygemm_unquantized_gemm with small-M BF16 predicate and multi-dim input handling; wire it into dispatch_unquantized_gemm() behind an import-time _TINYGEMM_AVAILABLE flag.

Test Plan

lm_eval accuracy parity on a BF16 model.
Decode TPOT benchmark at BS=1 on H100.
torch.compile fullgraph decode (no graph breaks).
Fallback path on non-SM90 / no-FlashInfer.

…ion) For batch size 1 (and up to M=8), replace cuBLAS BF16 GEMM with FlashInfer's tinygemm_bf16 kernel which is latency-optimized for small M using warp-specialized TMA + HMMA design from TRT-LLM. This affects all unquantized BF16 linear layers: attention QKV/O projections, dense MLP up/down projections, shared MoE experts, and latent MoE projections. Requires SM90+ (Hopper/Blackwell). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces support for B12X NVFP4 dense linear and fused MoE kernels for Blackwell (SM120/SM121) GPUs and optimizes unquantized GEMM operations for small batch sizes on SM90+ hardware using FlashInfer's tinygemm kernel. A critical correctness issue was identified in the default_unquantized_gemm implementation where multi-dimensional input tensors were not correctly handled, potentially leading to shape mismatches or incorrect kernel triggering. A code suggestion was provided to flatten the tokens and correctly calculate the number of tokens for the threshold check and output tensor allocation.

gemini-code-assist · 2026-04-15T15:33:11Z

+    if (
+        _tinygemm_bf16_available()
+        and x.shape[0] <= 8
+        and x.dtype == torch.bfloat16
+        and weight.dtype == torch.bfloat16
+        and weight.shape[0] % 16 == 0
+        and x.is_contiguous()
+        and weight.is_contiguous()
+        and (bias is None or bias.dtype == torch.bfloat16)
+    ):
+        from vllm.utils.flashinfer import flashinfer_tinygemm_bf16
+        out = torch.empty(
+            x.shape[0], weight.shape[0], dtype=torch.bfloat16, device=x.device
+        )
+        flashinfer_tinygemm_bf16(x, weight, out, bias=bias)
+        return out


The current implementation of default_unquantized_gemm using tinygemm_bf16 has a correctness issue when the input tensor x is multi-dimensional (e.g., [batch, seq, hidden]).

Incorrect Tiny Detection: x.shape[0] <= 8 only checks the first dimension. For a 3D tensor like [1, 1024, hidden], this would incorrectly trigger the tiny GEMM path for 1024 tokens because shape[0] is 1.

Shape Mismatch: The output tensor out is allocated as [x.shape[0], weight.shape[0]], which ignores intermediate dimensions (like seq). This will lead to a shape mismatch or crash when returning to the caller.

Kernel Application: The tinygemm kernel should be applied to a flattened view of the tokens to correctly handle multi-dimensional inputs.

I suggest calculating the total number of tokens M and using it for both the threshold check and the tensor views.

Suggested change

if (

_tinygemm_bf16_available()

and x.shape[0] <= 8

and x.dtype == torch.bfloat16

and weight.dtype == torch.bfloat16

and weight.shape[0] % 16 == 0

and x.is_contiguous()

and weight.is_contiguous()

and (bias is None or bias.dtype == torch.bfloat16)

):

from vllm.utils.flashinfer import flashinfer_tinygemm_bf16

out = torch.empty(

x.shape[0], weight.shape[0], dtype=torch.bfloat16, device=x.device

)

flashinfer_tinygemm_bf16(x, weight, out, bias=bias)

return out

num_tokens = x.numel() // x.shape[-1]

if (

_tinygemm_bf16_available()

and num_tokens <= 8

and x.dtype == torch.bfloat16

and weight.dtype == torch.bfloat16

and weight.shape[0] % 16 == 0

and x.is_contiguous()

and weight.is_contiguous()

and (bias is None or bias.dtype == torch.bfloat16)

):

from vllm.utils.flashinfer import flashinfer_tinygemm_bf16

out = x.new_empty((*x.shape[:-1], weight.shape[0]))

flashinfer_tinygemm_bf16(x.view(num_tokens, -1), weight,

out.view(num_tokens, -1), bias=bias)

return out

Handle inputs that may be 3D (batch, seq, hidden) by computing the total token count and reshaping for tinygemm, preserving the original shape in the output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add @torch._dynamo.assume_constant_result to _tinygemm_bf16_available() so dynamo doesn't trace into importlib/shutil calls that it can't handle. The function returns a hardware-capability constant, so this is safe. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Resolve tinygemm availability at import time and call torch.ops.flashinfer.tinygemm2_op directly instead of going through the lazy wrapper. This avoids importlib calls that dynamo cannot trace in fullgraph AOT compilation mode. - _init_tinygemm() runs at import: checks SM90+, JIT-compiles and registers the flashinfer::tinygemm2_op custom op - dispatch_unquantized_gemm() reads a plain bool global - _tinygemm_unquantized_gemm() only uses tensor ops and the registered custom op — all dynamo-safe Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Follow the established vLLM pattern for FlashInfer ops in compiled code (matching scaled_mm/flashinfer.py): - Register torch.ops.vllm.tinygemm_bf16 via direct_register_custom_op - Real impl calls FlashInfer lazily (runs at execution time) - Fake impl returns empty tensor (used during graph tracing) - Compiled graph captures the custom op as an opaque FX node This avoids calling importlib/FlashInfer wrappers during dynamo tracing while keeping the actual FlashInfer kernel in the graph. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Himan-D · 2026-04-21T08:21:17Z

Hi @WoosukKwon @youkaichao @robertgshaw2 @mgoin @tlrmchlsmth @houseroad @hmellor @yewentao256 @ProExpertProg @pavanimajety - this PR is ready for review!

Summary

Uses FlashInfer's tinygemm_bf16 kernel for small-M BF16 GEMVs (M ≤ 8)
Speeds up decode-phase QKV/O, MLP, MoE projections
Requires SM90+ and FlashInfer; falls back to F.linear otherwise
Includes comprehensive unit tests

Please review when you have time. Thanks!

mgoin · 2026-04-21T08:25:21Z

@Himan-D do you know if this tinygemm is PDL enabled? We are interested in a general gemm PDL replacement for low latency

mgoin · 2026-04-21T08:28:36Z

+        pass
+
+
+_init_tinygemm()


We should we should not import tinygemm on import of this file. It should be delayed to the first call of the unquantized gemm function itself

mgoin · 2026-04-21T08:30:58Z

+        capability = current_platform.get_device_capability()
+        if capability is None or capability[0] < 9:


Can we do the capability check only for arches that we have validated? I am wary of enabling this on all NVIDIA GPUs >= Hopper without benchmarks on each platform.

I will do microbenchmarking across multiple platforms to make sure we are honest here.

mgoin · 2026-04-21T08:32:13Z

+        if bias is None:
+            bias = torch.zeros(
+                weight.shape[0], dtype=torch.bfloat16, device=x.device,
+            )


It seems very suboptimal to be always allocating and filling bias on a low latency path. I would avoid this

https://github.com/flashinfer-ai/flashinfer/blob/9e3d8b9d1d11af893f7e7389baa25192db34bd8e/flashinfer/gemm/routergemm.py#L313

Seems non-optional. Not sure what a better solution looks like, but agree that filling the bias every time is a pain

maybe we can use a pattern like the workspace buffer for FlashInfer attention?

Yes, indeed. I will talk to FlashInfer kernel author to see if that is something that can be changed.

mgoin · 2026-04-21T08:33:46Z

+        num_tokens <= 8
+        and x.dtype == torch.bfloat16
+        and weight.dtype == torch.bfloat16
+        and weight.shape[0] % 16 == 0


I would like to see a microbenchmark script to make sure this gemm is good across a range of weight shapes, currently this will apply to most weights small or large

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

mergify · 2026-04-21T13:20:59Z

Hi @askliar, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

askliar · 2026-04-21T13:30:01Z

@mgoin tinygemm has PDL enabled. Check here: https://github.com/askliar/flashinfer/blob/main/csrc/tinygemm2.cu

benchislett · 2026-04-21T15:15:20Z

Is M <= 8 an arbitrary heuristic or is this the recommended choice? Is this backed by benchmarks?

mergify · 2026-04-21T18:53:35Z

Hi @askliar, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

vadiklyutiy

What do you think about bypassing all bf16 gemms to FalshInfer?
We do it for fp4 and fp8.

FlashInfer has autotuning and can choose the best kernel including this tinygemm.

Also I found early some perf gaps for medium size batches (from 257-271) where flashinfer can provide improvement. See #35467 and #27173

(But need 2 small modifications of flashinfer flashinfer-ai/flashinfer#2958 and flashinfer-ai/flashinfer#2914 - both ready to be merged).

Himan-D · 2026-04-22T15:11:05Z

Thanks for the review feedback!

A few updates on the current implementation:

Delayed import - Already handled via which uses and is called lazily on first use, not at module import time.
Architecture validation - Currently only enables on SM90+ (Hopper) as specified in the PR. We check which is the correct threshold.
Bias handling - You're right that the FlashInfer kernel appears to require bias. We've kept the current approach of handling None bias in the predicate check to skip tinygemm when bias is None. As @askliar mentioned, we can follow up with FlashInfer about making bias optional.
Multi-dimensional inputs - Fixed in commit - we now use to correctly count total tokens across all dimensions.
Microbenchmark - Added benchmark script in commit at

Would you like us to make any additional changes?

Himan-D · 2026-04-22T15:11:13Z

Thanks for the review! Here are the updates:

Delayed import - Already handled via lazy check with functools.cache
Architecture - Only enables on SM90+ (Hopper)
Bias handling - We skip tinygemm when bias is None (predicate check)
Multi-dimensional inputs - Fixed with num_tokens calculation
Microbenchmark - Added benchmark script

Let me know if you have more feedback!

Himan-D · 2026-04-22T15:13:47Z

Great question!

Yes, the FlashInfer tinygemm_bf16 kernel supports PDL (Programmatic Dependent Launch) via the parameter.

Looking at the FlashInfer source (flashinfer/gemm/routergemm.py):

When , it uses to overlap DMA with compute from preceding kernels - ideal for low-latency decode scenarios.

Currently our implementation doesn't enable PDL - we could add this as an option. Would you like us to:

Add PDL support to the tinygemm path?
Make it configurable via environment variable?

This would be particularly useful for PD (Prefill-Decode) disaggregation where we want to overlap the GEMM with previous kernels.

Himan-D · 2026-04-22T15:16:01Z

Hey @mgoin - to follow up on your question about PDL:

Yes the FlashInfer tinygemm_bf16 supports PDL (Programmatic Dependent Launch) via the use_pdl parameter. Currently we have it disabled (use_pdl=False).

For PD workloads, we could add PDL support - would you like us to:

Add use_pdl parameter to default_unquantized_gemm?
Make it configurable via env var like VLLM_TINYGEMM_USE_PDL?

This would help overlap GEMM compute with previous kernel DMA for lower latency.

Himan-D · 2026-04-22T15:18:08Z

Good question! Yes the FlashInfer tinygemm supports PDL via use_pdl parameter. Currently use_pdl=False. For PD workloads we could add PDL support - want us to add it as configurable option?

Himan-D · 2026-04-22T15:18:33Z

Following up on PDL question - yes FlashInfer tinygemm supports PDL (use_pdl param). Currently use_pdl=False. Could add as configurable option for PD workloads.

benchislett · 2026-04-22T18:15:39Z

We should use PDL.

ProExpertProg · 2026-04-22T18:57:13Z

I think if we can just call flashinfer.gemm generally and rely on FI-builtin autotuning to select the fastest one, that would be the best approach?

askliar requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners April 15, 2026 15:31

askliar marked this pull request as draft April 15, 2026 15:31

askliar force-pushed the askliar/flashinfer-tinygemm-bf16 branch from e821244 to f6c2425 Compare April 15, 2026 15:32

gemini-code-assist Bot reviewed Apr 15, 2026

View reviewed changes

Andrii Skliar and others added 2 commits April 15, 2026 17:39

Fix tinygemm dispatch for multi-dimensional input tensors

adf0e1a

Handle inputs that may be 3D (batch, seq, hidden) by computing the total token count and reshaping for tinygemm, preserving the original shape in the output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mergify Bot added the nvidia label Apr 15, 2026

github-project-automation Bot added this to NVIDIA Apr 15, 2026

Andrii Skliar and others added 2 commits April 15, 2026 18:45

askliar changed the title ~~WIP: Use FI tinygemm~~ Use FI tinygemm for faster BF16 GEMMs Apr 16, 2026

Merge branch 'main' into askliar/flashinfer-tinygemm-bf16

e44cc91

mgoin reviewed Apr 21, 2026

View reviewed changes

askliar marked this pull request as ready for review April 21, 2026 13:15

claude Bot reviewed Apr 21, 2026

View reviewed changes

Merge branch 'main' into askliar/flashinfer-tinygemm-bf16

e9c1019

askliar mentioned this pull request Apr 21, 2026

WIP: Add B12x support for non-gated MoEs (Nemotron) #39920

Draft

benchislett added the verified Run pre-commit for new contributors without triggering other tests label Apr 21, 2026

vadiklyutiy reviewed Apr 21, 2026

View reviewed changes

benchislett mentioned this pull request Apr 21, 2026

[Feature]: General LL GEMMs with PDL Support #38772

Open

1 task

raayandhar mentioned this pull request Apr 22, 2026

[Kernel] Add bf16 mm gemm flashinfer backend #40638

Open

4 tasks

		capability = current_platform.get_device_capability()
		if capability is None or capability[0] < 9:

Uh oh!

Conversation

askliar commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Test Plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 15, 2026

Choose a reason for hiding this comment

Uh oh!

Himan-D commented Apr 21, 2026

Summary

Uh oh!

mgoin commented Apr 21, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

mergify Bot commented Apr 21, 2026

Uh oh!

askliar commented Apr 21, 2026

Uh oh!

benchislett commented Apr 21, 2026

Uh oh!

mergify Bot commented Apr 21, 2026

Uh oh!

vadiklyutiy left a comment

Choose a reason for hiding this comment

Uh oh!

Himan-D commented Apr 22, 2026

Uh oh!

Himan-D commented Apr 22, 2026

Uh oh!

Himan-D commented Apr 22, 2026

Uh oh!

Himan-D commented Apr 22, 2026

Uh oh!

Himan-D commented Apr 22, 2026

Uh oh!

Himan-D commented Apr 22, 2026

Uh oh!

benchislett commented Apr 22, 2026

Uh oh!

ProExpertProg commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

askliar commented Apr 15, 2026 •

edited

Loading