Skip to content

[CuTeDSL] Add BF16 Grouped GEMM example for Hopper (SM90)#3059

Closed
vruga wants to merge 1 commit intoNVIDIA:mainfrom
vruga:feat/cutedsl-hopper-bf16-grouped-gemm
Closed

[CuTeDSL] Add BF16 Grouped GEMM example for Hopper (SM90)#3059
vruga wants to merge 1 commit intoNVIDIA:mainfrom
vruga:feat/cutedsl-hopper-bf16-grouped-gemm

Conversation

@vruga
Copy link

@vruga vruga commented Feb 23, 2026

Summary

Adds examples/python/CuTeDSL/hopper/grouped_gemm.py, a Python/CuTeDSL grouped GEMM kernel for the NVIDIA Hopper SM90 architecture with BF16 (and Float16) input support.

This is the CuTeDSL equivalent of examples/57_hopper_grouped_gemm, which:

  • exists only in C++, and
  • only supports FP8.

This PR adds a Python implementation that also covers the BF16 use case — the precision most commonly required for accuracy-sensitive workloads like Mixture-of-Experts (MoE) serving on Hopper.

Adds examples/python/CuTeDSL/hopper/grouped_gemm.py, a Python/CuTeDSL
grouped GEMM kernel targeting the NVIDIA Hopper SM90 architecture.
This is the CuTeDSL equivalent of the C++ example 57_hopper_grouped_gemm,
which only supports FP8. The new example adds BF16 (and Float16) support
and a full Python implementation.

Key design points:
- Uses SM90 WGMMA (warp group MMA) instead of SM100 tcgen05.
- Register-based accumulators (make_rmem_tensor) — SM90 has no TMEM.
- PipelineTmaAsync for the A/B mainloop pipeline.
- Warp specialization: DMA warp group (TMA loads + tensormap A/B updates)
  and one or two MMA warp groups (WGMMA + epilogue + tensormap C updates).
- TensorMapManager (arch-agnostic) for per-group TMA descriptor updates,
  supporting both SMEM and GMEM update modes.
- StaticPersistentGroupTileScheduler for persistent multi-group scheduling.
- Supports BF16/Float16 inputs; Float16/BFloat16/Float32 outputs.
- Includes host-side helpers, reference-check, and benchmarking harness.

Closes NVIDIA#3040

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@vruga vruga closed this Feb 23, 2026
@vruga vruga deleted the feat/cutedsl-hopper-bf16-grouped-gemm branch February 23, 2026 19:21
@jiuzhengWang
Copy link

what happended to this PR? is there anything wrong with the implementation?

@vruga
Copy link
Author

vruga commented Mar 14, 2026

i put the same pr after verifying

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants