SM90 Hopper Grouped GEMM Kernels by Knarf04 · Pull Request #37 · tgale96/grouped_gemm

Knarf04 · 2026-03-03T08:59:18Z

Overview

This fork adds SM90 (Hopper) support for grouped GEMM via CUTLASS 4.0+, including both cooperative and pingpong kernel schedules. These kernels target MoE (Mixture-of-Experts) workloads where expert routing creates variable-sized GEMMs.

Benchmark

Hardware: NVIDIA H100
Config: Qwen/Qwen3-30B-A3B's up_proj — 128 experts, K=2048, N=1536, 65536 tokens

Backend	Uniform (TFLOPs)	Mild Skew (TFLOPs)	Extreme Skew (TFLOPs)
cuBLAS (base)	129.5	118.0	112.3
CUTLASS SM80	116.9	109.2	113.2
cuBLAS SM90 (batched)	85.2	95.1	97.8
CUTLASS SM90 cooperative	160.5	151.5	154.3
CUTLASS SM90 pingpong	134.0	124.7	121.1

Key takeaways

SM90 cooperative achieves up to 1.37x speedup over cuBLAS and 1.37x over CUTLASS SM80.
Performance advantage grows under skewed workloads (realistic MoE routing), where load imbalance across experts degrades sequential-launch strategies more than the grouped kernel.
SM90 pingpong offers a middle ground, outperforming SM80 CUTLASS across all distributions.

Building for SM90

TORCH_CUDA_ARCH_LIST=9.0 GROUPED_GEMM_CUTLASS=1 pip install .

Running benchmarks

python benchmark.py

Haochen Shen and others added 11 commits November 19, 2025 18:12

Start to work on a separate function

1b845ed

Update

5035028

fix test cases

68f2154

fix test cases

da7a83f

Bump CUTLASS submodule to v4.0.0

c075f94

split base cublas and cutlass

60a1d90

Commit before looking into CUTLASS for sm90

e1ea106

Commit before looking into CUTLASS for sm90

1d388bd

sm90 kernels fully debugged

5f17c60

Tuned sm90 kernels, now they have better performance

514ad86

Delete run.log

3380732

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SM90 Hopper Grouped GEMM Kernels#37

SM90 Hopper Grouped GEMM Kernels#37
Knarf04 wants to merge 11 commits intotgale96:mainfrom
Knarf04:pr-sm90

Knarf04 commented Mar 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Knarf04 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Benchmark

Key takeaways

Building for SM90

Running benchmarks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Knarf04 commented Mar 3, 2026 •

edited

Loading