[CuTeDSL] Add BF16 Grouped GEMM example for Hopper (SM90) by vruga · Pull Request #3059 · NVIDIA/cutlass

vruga · 2026-02-23T19:10:21Z

Summary

Adds examples/python/CuTeDSL/hopper/grouped_gemm.py, a Python/CuTeDSL grouped GEMM kernel for the NVIDIA Hopper SM90 architecture with BF16 (and Float16) input support.

This is the CuTeDSL equivalent of examples/57_hopper_grouped_gemm, which:

exists only in C++, and
only supports FP8.

This PR adds a Python implementation that also covers the BF16 use case — the precision most commonly required for accuracy-sensitive workloads like Mixture-of-Experts (MoE) serving on Hopper.

Adds examples/python/CuTeDSL/hopper/grouped_gemm.py, a Python/CuTeDSL grouped GEMM kernel targeting the NVIDIA Hopper SM90 architecture. This is the CuTeDSL equivalent of the C++ example 57_hopper_grouped_gemm, which only supports FP8. The new example adds BF16 (and Float16) support and a full Python implementation. Key design points: - Uses SM90 WGMMA (warp group MMA) instead of SM100 tcgen05. - Register-based accumulators (make_rmem_tensor) — SM90 has no TMEM. - PipelineTmaAsync for the A/B mainloop pipeline. - Warp specialization: DMA warp group (TMA loads + tensormap A/B updates) and one or two MMA warp groups (WGMMA + epilogue + tensormap C updates). - TensorMapManager (arch-agnostic) for per-group TMA descriptor updates, supporting both SMEM and GMEM update modes. - StaticPersistentGroupTileScheduler for persistent multi-group scheduling. - Supports BF16/Float16 inputs; Float16/BFloat16/Float32 outputs. - Includes host-side helpers, reference-check, and benchmarking harness. Closes NVIDIA#3040 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

jiuzhengWang · 2026-03-05T18:46:55Z

what happended to this PR? is there anything wrong with the implementation?

vruga · 2026-03-14T06:35:53Z

i put the same pr after verifying

vruga closed this Feb 23, 2026

vruga deleted the feat/cutedsl-hopper-bf16-grouped-gemm branch February 23, 2026 19:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CuTeDSL] Add BF16 Grouped GEMM example for Hopper (SM90)#3059

[CuTeDSL] Add BF16 Grouped GEMM example for Hopper (SM90)#3059
vruga wants to merge 1 commit intoNVIDIA:mainfrom
vruga:feat/cutedsl-hopper-bf16-grouped-gemm

vruga commented Feb 23, 2026 •

edited

Loading

Uh oh!

jiuzhengWang commented Mar 5, 2026

Uh oh!

vruga commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

vruga commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Uh oh!

jiuzhengWang commented Mar 5, 2026

Uh oh!

vruga commented Mar 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

vruga commented Feb 23, 2026 •

edited

Loading