Skip to content

SM90 Hopper Grouped GEMM Kernels#37

Open
Knarf04 wants to merge 11 commits intotgale96:mainfrom
Knarf04:pr-sm90
Open

SM90 Hopper Grouped GEMM Kernels#37
Knarf04 wants to merge 11 commits intotgale96:mainfrom
Knarf04:pr-sm90

Conversation

@Knarf04
Copy link
Copy Markdown

@Knarf04 Knarf04 commented Mar 3, 2026

Overview

This fork adds SM90 (Hopper) support for grouped GEMM via CUTLASS 4.0+, including both cooperative and pingpong kernel schedules. These kernels target MoE (Mixture-of-Experts) workloads where expert routing creates variable-sized GEMMs.

Benchmark

Hardware: NVIDIA H100
Config: Qwen/Qwen3-30B-A3B's up_proj — 128 experts, K=2048, N=1536, 65536 tokens

Backend Uniform (TFLOPs) Mild Skew (TFLOPs) Extreme Skew (TFLOPs)
cuBLAS (base) 129.5 118.0 112.3
CUTLASS SM80 116.9 109.2 113.2
cuBLAS SM90 (batched) 85.2 95.1 97.8
CUTLASS SM90 cooperative 160.5 151.5 154.3
CUTLASS SM90 pingpong 134.0 124.7 121.1

Key takeaways

  • SM90 cooperative achieves up to 1.37x speedup over cuBLAS and 1.37x over CUTLASS SM80.
  • Performance advantage grows under skewed workloads (realistic MoE routing), where load imbalance across experts degrades sequential-launch strategies more than the grouped kernel.
  • SM90 pingpong offers a middle ground, outperforming SM80 CUTLASS across all distributions.

Building for SM90

TORCH_CUDA_ARCH_LIST=9.0 GROUPED_GEMM_CUTLASS=1 pip install .

Running benchmarks

python benchmark.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant