CUTLASS 4.2.0 #2649

hwu36 · 2025-09-18T03:32:52Z

hwu36
Sep 18, 2025
Maintainer

CuTe DSL

More Python versions are now supported for both x86-64 and aarch64, including
- Python 3.10, 3.11, 3.12, and 3.13
Added new example and updated notebook to get started with CuTe DSL
- Call kernels with dlpack bypassed
- Updates on TensorSSA demonstration
  - Added a section for introducing the broadcast
API updates
- Please refer to DSL API changelog for details
Bug fixings and improvements
- Fixed cute.print_tensor for coordinate tensor
- Fixed cute.print for tuple of layouts
- Fixed frozen object is not properly updated after fully assigned in dynamic control flow
- Fixed assign tuple/list element in a dynamic control flow may cause compilation failure
- Improved error message when CUDA context is not initialized
- Improved docstring of congruent and weakly_congruent

CUTLASS C++

Support for Blackwell SM103 kernels for B300 GPUs.
- Collective mainloop codes: Blockscaled datatypes with support for dense GEMM mainloop
- New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
- Kernel codes: Blockscaled datatypes with support for dense GEMM kernel.
Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture:
- Blockscaled ultra fp4 dense GEMM.
- Blockscaled ultra fp4 dense grouped GEMM.
Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM
- Unit test files with prefix name of sm103_ under GEMM device unit tests.
Support for Blackwell SM121 kernels for DGX Spark GPUs.
- Share the major codes with Blackwell SM120 kernels.
Add support for heuristics-based kernel filtering and autotuning using nvidia-matmul-heuristics to find the best kernels for a given scenario.
- Details please refer to heuristics doc.
Further enhance Blackwell SM100 Attention kernels in example 77.
- Add fused reduction kernel support for cutlass MLA.
- Add softmax skip correction.
- Support for GQA in FMHA backward kernel.
- Fix an issue where get_unmasked_trip_count may return a negative value.
- Fix an issue where mbarriers are initialized with a zero arrival count.
- Fix a corner case issue where the sequence length of q is not a multiple of tile_q.
- Remove tma padding for forward kernel inputs.
Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): example 92. It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups/experts, unlike general Grouped GEMMs. Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome.
Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell
- On Blackwell SM120, a blockwise gemm kernel is added: example 87.
- On Hopper, add K major scale factor support for SM90 blockwise kernels.
- On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size.
- On Hopper, grouped version supports the case when k = 0.
Support for Blackwell SM100 fp4 gemv kernels.
- Kernel codes: Gemv kernel.
- Example codes: example 91
Support for Blackwell SM100 legacy mixed input GEMM kernels.
- Collective mainloop codes: Mixed input mainloop.
- Kernel codes: Mixed input kernel.
- Example codes: example 86.
Support for Blackwell SM100 cpasync kernel.
- Collective mainloop codes: cpasync mainloop.
- Kernel codes: cpasync kernel.
Support Blackwell SM120 mixed input blockscaled grouped GEMM.
Instantiating more Blackwell kernels in profiler.
- Blackwell SM100 and SM103 kernels support CUTLASS_LIBRARY_INSTANTIATION_LEVEL to instantiate all possible combinations.
- To use this feature, CUTLASS_LIBRARY_KERNELS must be non-empty. Profiler will combine CUTLASS_LIBRARY_KERNELS and CUTLASS_LIBRARY_INSTANTIATION_LEVEL to instantiate specific kernels.
- Details please check Profiler Doc.
Fix some profiler issues:
- Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line.
- Fix some no output and timeout issues.
- Fix Pingpong Blockwise Hopper library generation.
From CUDA 13.0, the Blackwell SM101 for Thor GPUs is renamed to SM110.
- For CUDA toolkit version < 13.0, SM101 is still used for Thor GPUs.
- For CUDA toolkit version >= 13.0, SM110 is used for Thor GPUs and SM101 is no longer valid.
Rename legacy Python API package from cutlass to cutlass_cppgen and add Blackwell EVT support to legacy Python interface.
- Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's EpilogueDescriptors.
- Added Blackwell SM100 EVT Emitter on the Python side and routed most emission through Hopper SM90 Emitter.
- Added some support for running SM100 kernels via the Python interface.
CuTe changes:
- Fix inaccurate GridDim calculation under CuTe tutorial.
- Add movmatrix support.
- Fix smallest MMA-N allowed for Blackwell fp8 and fp16 gemm kernels.
- Support fp16 accmulator for sm89 fp8 mma.
- Shorten nullspace implementation.
- Isolate and comment on cosize hacks.
- Important documentation correction: E<0,1> == 1@0@1.
Fix some kernel issues:
- Fix Hopper SM90 group gemm kernel to only use the commit group and wait group instead of also waiting on mbarriers.
- Fix a tiny bug when K is large for Blackwell SM103 fp4 grouped GEMM kernel.
Add following unit tests:
- fp16 accmulator for sm89 fp8 mma
- movmatrix test
- fp8 narrow mma n and fp16 narrow mma n
Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
Optimal code generation with CUDA toolkit versions 13.0U1.

This discussion was created from the release CUTLASS 4.2.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CUTLASS 4.2.0 #2649

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

CUTLASS 4.2.0 #2649

Uh oh!

hwu36 Sep 18, 2025 Maintainer

CuTe DSL

CUTLASS C++

Replies: 0 comments

hwu36
Sep 18, 2025
Maintainer