Releases · NVIDIA/cutlass

24 Sep 05:23

hwu36

v4.2.1

f3fde58

CUTLASS 4.2.1 Latest

Latest

CuTe DSL

Bug fixings and improvements
- Fixed an issue when running DSL codes with cuda-python 13.0
- Fixed an issue when running inductor with DSL codes
- Fixed an issue with unexpected logging when running DSL codes in FlashInfer
- Fixed the issue reported in #2647
- Fixed an issue when conditional define of variables outside of dynamic control flow

CUTLASS C++

Bypass EVT for nosmem blockwise kernels on Blackwell.
Rename cutlass/python/cutlass directory to cutlass/python/cutlass_cppgen.

Assets 2

18 Sep 03:32

hwu36

v4.2.0

59b61c6

CUTLASS 4.2.0

CuTe DSL

More Python versions are now supported for both x86-64 and aarch64, including
- Python 3.10, 3.11, 3.12, and 3.13
Added new example and updated notebook to get started with CuTe DSL
- Call kernels with dlpack bypassed
- Updates on TensorSSA demonstration
  - Added a section for introducing the broadcast
API updates
- Please refer to DSL API changelog for details
Bug fixings and improvements
- Fixed cute.print_tensor for coordinate tensor
- Fixed cute.print for tuple of layouts
- Fixed frozen object is not properly updated after fully assigned in dynamic control flow
- Fixed assign tuple/list element in a dynamic control flow may cause compilation failure
- Improved error message when CUDA context is not initialized
- Improved docstring of congruent and weakly_congruent

CUTLASS C++

Support for Blackwell SM103 kernels for B300 GPUs.
- Collective mainloop codes: Blockscaled datatypes with support for dense GEMM mainloop
- New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
- Kernel codes: Blockscaled datatypes with support for dense GEMM kernel.
Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture:
- Blockscaled ultra fp4 dense GEMM.
- Blockscaled ultra fp4 dense grouped GEMM.
Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM
- Unit test files with prefix name of sm103_ under GEMM device unit tests.
Support for Blackwell SM121 kernels for DGX Spark GPUs.
- Share the major codes with Blackwell SM120 kernels.
Add support for heuristics-based kernel filtering and autotuning using nvidia-matmul-heuristics to find the best kernels for a given scenario.
- Details please refer to heuristics doc.
Further enhance Blackwell SM100 Attention kernels in example 77.
- Add fused reduction kernel support for cutlass MLA.
- Add softmax skip correction.
- Support for GQA in FMHA backward kernel.
- Fix an issue where get_unmasked_trip_count may return a negative value.
- Fix an issue where mbarriers are initialized with a zero arrival count.
- Fix a corner case issue where the sequence length of q is not a multiple of tile_q.
- Remove tma padding for forward kernel inputs.
Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): example 92. It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups/experts, unlike general Grouped GEMMs. Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome.
Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell
- On Blackwell SM120, a blockwise gemm kernel is added: example 87.
- On Hopper, add K major scale factor support for SM90 blockwise kernels.
- On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size.
- On Hopper, grouped version supports the case when k = 0.
Support for Blackwell SM100 fp4 gemv kernels.
- Kernel codes: Gemv kernel.
- Example codes: example 91
Support for Blackwell SM100 legacy mixed input GEMM kernels.
- Collective mainloop codes: Mixed input mainloop.
- Kernel codes: Mixed input kernel.
- Example codes: example 86.
Support for Blackwell SM100 cpasync kernel.
- Collective mainloop codes: cpasync mainloop.
- Kernel codes: cpasync kernel.
Support Blackwell SM120 mixed input blockscaled grouped GEMM.
Instantiating more Blackwell kernels in profiler.
- Blackwell SM100 and SM103 kernels support CUTLASS_LIBRARY_INSTANTIATION_LEVEL to instantiate all possible combinations.
- To use this feature, CUTLASS_LIBRARY_KERNELS must be non-empty. Profiler will combine CUTLASS_LIBRARY_KERNELS and CUTLASS_LIBRARY_INSTANTIATION_LEVEL to instantiate specific kernels.
- Details please check Profiler Doc.
Fix some profiler issues:
- Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line.
- Fix some no output and timeout issues.
- Fix Pingpong Blockwise Hopper library generation.
From CUDA 13.0, the Blackwell SM101 for Thor GPUs is renamed to SM110.
- For CUDA toolkit version < 13.0, SM101 is still used for Thor GPUs.
- For CUDA toolkit version >= 13.0, SM110 is used for Thor GPUs and SM101 is no longer valid.
Rename legacy Python API package from cutlass to cutlass_cppgen and add Blackwell EVT support to legacy Python interface.
- Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's EpilogueDescriptors.
- Added Blackwell SM100 EVT Emitter on the Python side and routed most emission through Hopper SM90 Emitter.
- Added some support for running SM100 kernels via the Python interface.
CuTe changes:
- Fix inaccurate GridDim calculation under CuTe tutorial.
- Add movmatrix support.
- Fix smallest MMA-N allowed for Blackwell fp8 and fp16 gemm kernels.
- Support fp16 accmulator for sm89 fp8 mma.
- Shorten nullspace implementation.
- Isolate and comment on cosize hacks.
- Important documentation correction: E<0,1> == 1@0@1.
Fix some kernel issues:
- Fix Hopper SM90 group gemm kernel to only use the commit group and wait group instead of also waiting on mbarriers.
- Fix a tiny bug when K is large for Blackwell SM103 fp4 grouped GEMM kernel.
Add following unit tests:
- fp16 accmulator for sm89 fp8 mma
- movmatrix test
- fp8 narrow mma n and fp16 narrow mma n
Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
Optimal code generation with CUDA toolkit versions 13.0U1.

Assets 2

0 Join discussion

28 Jul 03:57

hwu36

v4.1.0

e51efbf

CUTLASS 4.1.0

CuTe DSL

Add aarch64 support, you can now pip install nvidia-cutlass-dsl on GB200 systems!
More examples demonstrating how to use CuTe DSL to write peak-performance kernels
- Blackwell Mamba2 SSD
- Blackwell SM100 persistent dense blockscaled GEMM with static scheduling
API updates
- Please refer to FUNCTIONALITY.md for details

CUTLASS C++

Further enhance Blackwell SM100 Attention kernels in example 77.
- Add variable sequence length support for FMHA Backward kernel.
- Add varlen test support to Backward runner.
- Codes support empty batch sequences.
Replace subbyte_iterator with cute::recast_ptr when constructing logical iterators/arrays.
CuTe changes:
- Rewrite ArithTuple and ScaledBasis for robustness and clarity.
- Remove buggy and kludgy get_layoutA|B|C_MN and friends from Atoms/TiledX.
- Factor out print_latex and friends and rewrite.
- Factor out print_svg and friends and rewrite.
Support Blackwell SM100 SIMT packed fp32x2 kernels.
Support residual add for implicit gemm kernels.
Various fixes for CUTLASS C++ Python interface's EVT tracer:
- Add verifier for sm90 to report the invalid input.
- When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
- Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
- Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
Fix profiler bugs in exhaustive perf search.
- Fix incorrect cluster shape output issue when doing exhaustive search.
- Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.
Fix some profiler issues.
- Complete the reference for Blackwell blockwise gemm kernels.
- Fix incorrect regex logic for L1 test.

Assets 2

1 Join discussion

27 Jun 14:17

kerrmudgeon

v4.0.0

b995f93

CUTLASS 4.0.0

CuTe DSL

CuTe DSL is a Python DSL centered around CuTe's abstractions

Enables authoring kernels in Python to reach peak performance on NVIDIA GPUs
Core DSL implementation files
DSL quick start
DSL Overview
Educational notebooks for getting started with CuTe DSL

CUTLASS C++

Support Family Specific Architecture Features which was introduced in CUDA 12.9
Further improved Blockwise and Groupwise GEMMs on Hopper and Blackwell
Enhance Blackwell SM100 Attention kernels in example 77
Add Blackwell SM100 implicit GEMM conv fprop/dgrad/wgrad unit tests
New Hopper SM90 FMHA example, similar in design to the existing Blackwell FMHA
Cute enhancements: CuTe C++ reduce op
Other functional and performance enhancements

Assets 2

04 May 04:25

hwu36

v3.9.2

ad7b2f5

CUTLASS 3.9.2

Fixed Blockwise and Groupwise GEMM hang issue when problem size K is 128.
Optimal code generation with CUDA toolkit versions 12.9.

Assets 2

01 May 04:29

hwu36

v3.9.1

f535c33

CUTLASS 3.9.1

Fixed Group Gemm hang issue in CUTLASS 3.x
Improved Hopper Blockwise and Groupwise GEMM performance.

Assets 2

25 Apr 01:53

hwu36

v3.9.0

e94e888

CUTLASS 3.9.0

Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:
- Collective mainloops that target for:
  - Blockscaled datatypes with support for dense GEMM
  - Blockscaled datatypes with support for sparse GEMM
- New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
- Blackwell SM120 epilogue and full set of EVT fusions.
Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture:
Set of unit tests that demonstrate the usage of both sparse and dense Blackwell SM120 blockscaled GEMM.
Support for Blackwell SM100 Sparse kernels:
- Collective mainloop that target for
  - SM100 Sparse GEMM
Set of example that demonstrate the usage of the 3.x API for targeting Blackwell SM100 Sparse GEMM:
Set of unit tests that demonstrate the usage of sparse and blockscaled sparse Blackwell SM100 GEMM.
A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS example covers the flashMLA-like weight-absorbed decoding use-case.
A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS example to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.
A new distributed GEMM example for SM100 Blackwell architecture.
Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
- Enhancement of blockwise GEMM for Hopper architecture.
- Enhancement of groupwise GEMM for Hopper architecture.
- Support for grouped GEMM with blockwise and groupwise scaling for Hopper architecture.
- Support for grouped-wise GEMM in CUTLASS profiler.
- Support for blockwise GEMM for Blackwell architecture.
- Support for groupwise GEMM for Blackwell architecture.
- Support for grouped GEMM with blockwise and groupwise scaling for Blackwell architecture.
Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
- Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
- Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
- Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
- More detailed introductions and examples to leverage this feature can be found in profiler.md.
Support void as the D element in sm100 kernel epilogues.

Assets 2

0 Join discussion

21 Feb 05:32

hwu36

v3.8.0

afa1772

CUTLASS 3.8.0

CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture.
For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.

Support for new CuTe building blocks specifically for Blackwell SM100 architecture:
- 5th generation Blackwell Tensor Core instructions (TCGen05) via CuTe MMA atoms.
- Extensions to Tensor Memory Accelerator via CuTe Copy atoms.
- Exposure of Blackwell's new tensor memory (note: distinct from TMA) as tmem across CuTe as a first class data locale.
- Exposure of tmem->rmem, rmem->tmem and smem->tmem data movement instructions as copy atoms in CuTe.
- make_tmem_copy() utility method to ease creation of tiled copies for tmem copy atoms.
- Support for new variants of LDSM on Blackwell via CuTe Copy atoms.
Support for new CUTLASS building blocks specifically for Blackwell SM100 architecture:
- Various narrow precision FP4, FP6, and FP8 formats as well as their block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8
- Pipelines that implement Blackwell specific synchronization.
- Cluster launch control API supporting preferred and fallback cluster shapes.
- Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
- Tile schedulers using Blackwell's Cluster Launch Control (CLC) feature to implement dynamic persistence scheduling for GEMMs, and stream-K.
- Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
Full support for Blackwell SM100 kernels in CUTLASS 3.x API:
- Blackwell specific kernel layers that
  - Implement a new warp-specialization recipe tuned specifically for Blackwell SM100 architecture.
  - Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators.
  - Support stream-K load balancing for all kernel types everywhere via composable scheduler support.
- Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for
- Blackwell collective mainloop for convolution kernels supporting non-block scaled data types for fprop, dgrad, and wgrad.
- New GEMM, convolution, and epilogue dispatch policies for collectives, kernel layers, and builders.
- Blackwell epilogue that supports loading accumulators from tmem and full set of EVT fusions.
CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
- Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
- Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
- Support for mixed input GEMM kernels on Hopper in the profiler.
New CUTLASS profiler flag use-cuda-graphs to reduce overheads when benchmarking launch-bound kernels.
A new 3.x version of grouped GEMM to the CUTLASS library and generates kernels for Hopper and Blackwell. Now grouped GEMM support is enabled in the CUTLASS profiler (./cutlass_profiler --operation=GroupedGemm --help for details).
Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architecture:
- Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples, demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API.
- GEMM with opt-in collective builder schedules showcasing available recipes for Blackwell.
- Block scaled data type GEMMs targeting Blackwell's native block scaled Tensor Cores:
- GEMM example demonstrating Blackwell's new preferred cluster support via dynamic cluster shapes for increased occupancy.
- GEMM with CLC based StreamK scheduler for load balancing.
- Grouped GEMM for vanilla FP8 data inputs and NVFP4 block scaled inputs.
- Convolution kernels for fprop, dgrad, and wgrad.
- Fused multi-head attention fprop kernel supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
- A new BF16x9 GEMM kernel that emulates FP32 GEMM (SGEMM) using BF16 operations.
Set of examples that demonstrate the usage of the 3.x API for targeting Hopper architecture:
- A set of new Hopper grouped GEMM kernels that support mixed A and B datatypes.
- A new Hopper FP8 GEMM with groupwise scaling.
Documentation updates:
- Quickstart - instantiating a Blackwell block-scaled GEMM.
- Detailed Blackwell block-scaled GEMM functionality documentation
- A new functionality documentation specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
- Updates to compatibility section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and Target Architecture.

Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.

Assets 2

0 Join discussion

18 Jan 15:07

hwu36

v3.7.0

b78588d

CUTLASS 3.7.0

A new Hopper blockwise scaling FP8 GEMM where the operands and block scaling tensor are staged via shared memory.
Distributed GEMM is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new make_kernel_hardware_info API as shown in example 48.
Enabled high precision accumulation for Hopper FP8 Sparse GEMM.

Assets 2

0 Join discussion

25 Dec 22:19

hwu36

v3.6.0

bf9da7b

CUTLASS 3.6.0

Hopper structured sparse GEMM.
- FP16
- FP8
- INT8
- TF32
A refactor to the CUTLASS 3.x convolution kernel::ConvUniversal API to bring it in line with gemm::GemmUniversal. Now the 3.x convolution API is no longer considered as a beta API.
An improved mixed input GEMM and a lookup table implementation for INT4xFP8 scale-only mode.
EVT nodes for Top-K selection and softmax and GEMM example using those.
Programmatic Dependent Launch (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding documentations.
A new debugging tool, synclog, for dumping out all synchronization events from within a kernel to a file. Please see synclog documentation for details.
A new TMA-enabled epilogue for grouped GEMM that brings significant performance improvement, as well as its EVT support.
A SIMT-enabled pointer-array epilogue.
A new Ping-Pong kernel schedule for Grouped GEMM and some other optimizations.
A new instantiation strategy for CUTLASS profiler kernels along with improved documentation for instantiation level in CUTLASS profiler.
A new hardware support for comparisons and computations of cutlass::bfloat16_t
Fixed use of isnan on Windows for half_t.

Assets 2

2 Join discussion

Releases: NVIDIA/cutlass

CUTLASS 4.2.1

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.2.0

CuTe DSL

CUTLASS C++

Uh oh!

CUTLASS 4.1.0

Uh oh!

CUTLASS 4.0.0

Uh oh!

CUTLASS 3.9.2

Uh oh!

CUTLASS 3.9.1

Uh oh!

CUTLASS 3.9.0

Uh oh!

CUTLASS 3.8.0

Uh oh!

CUTLASS 3.7.0

Uh oh!

CUTLASS 3.6.0

Uh oh!