Releases: NVIDIA/cutlass
Releases · NVIDIA/cutlass
CUTLASS 4.2.1
CuTe DSL
- Bug fixings and improvements
- Fixed an issue when running DSL codes with cuda-python 13.0
 - Fixed an issue when running inductor with DSL codes
 - Fixed an issue with unexpected logging when running DSL codes in FlashInfer
 - Fixed the issue reported in #2647
 - Fixed an issue when conditional define of variables outside of dynamic control flow
 
 
CUTLASS C++
- Bypass EVT for nosmem blockwise kernels on Blackwell.
 - Rename cutlass/python/cutlass directory to cutlass/python/cutlass_cppgen.
 
CUTLASS 4.2.0
CuTe DSL
- More Python versions are now supported for both x86-64 and aarch64, including
- Python 3.10, 3.11, 3.12, and 3.13
 
 - Added new example and updated notebook to get started with CuTe DSL
- Call kernels with dlpack bypassed
 - Updates on TensorSSA demonstration
- Added a section for introducing the broadcast
 
 
 - API updates
- Please refer to DSL API changelog for details
 
 - Bug fixings and improvements
- Fixed 
cute.print_tensorfor coordinate tensor - Fixed 
cute.printfor tuple of layouts - Fixed frozen object is not properly updated after fully assigned in dynamic control flow
 - Fixed assign tuple/list element in a dynamic control flow may cause compilation failure
 - Improved error message when CUDA context is not initialized
 - Improved docstring of congruent and weakly_congruent
 
 - Fixed 
 
CUTLASS C++
- Support for Blackwell SM103 kernels for B300 GPUs.
- Collective mainloop codes: Blockscaled datatypes with support for dense GEMM mainloop
 - New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
 - Kernel codes: Blockscaled datatypes with support for dense GEMM kernel.
 
 - Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM103 architecture:
 - Set of unit tests that demonstrate the usage of Blackwell SM103 blockscaled GEMM
- Unit test files with prefix name of 
sm103_under GEMM device unit tests. 
 - Unit test files with prefix name of 
 - Support for Blackwell SM121 kernels for DGX Spark GPUs.
- Share the major codes with Blackwell SM120 kernels.
 
 - Add support for heuristics-based kernel filtering and autotuning using 
nvidia-matmul-heuristicsto find the best kernels for a given scenario.- Details please refer to heuristics doc.
 
 - Further enhance Blackwell SM100 Attention kernels in example 77.
- Add fused reduction kernel support for cutlass MLA.
 - Add softmax skip correction.
 - Support for GQA in FMHA backward kernel.
 - Fix an issue where 
get_unmasked_trip_countmay return a negative value. - Fix an issue where mbarriers are initialized with a zero arrival count.
 - Fix a corner case issue where the sequence length of q is not a multiple of tile_q.
 - Remove tma padding for forward kernel inputs.
 
 - Add Blackwell SM100 kernels for MoEs (focusing on Low-Latency inference performance): example 92. It uses TMA (for weights) and CPASYNC (for tokens) to load input matrices and allow only one problem dimension to vary across groups/experts, unlike general Grouped GEMMs. Note: further API simplifications and kernel improvements are upcoming. Any feedback on API is welcome.
 - Further enhance blockwise and groupwise GEMMs on Hopper and Blackwell
- On Blackwell SM120, a blockwise gemm kernel is added: example 87.
 - On Hopper, add K major scale factor support for SM90 blockwise kernels.
 - On Hopper, relax the restriction that the k dimension of the problem size has to be the multiple of the k dimension of the tile size.
 - On Hopper, grouped version supports the case when k = 0.
 
 - Support for Blackwell SM100 fp4 gemv kernels.
- Kernel codes: Gemv kernel.
 - Example codes: example 91
 
 - Support for Blackwell SM100 legacy mixed input GEMM kernels.
- Collective mainloop codes: Mixed input mainloop.
 - Kernel codes: Mixed input kernel.
 - Example codes: example 86.
 
 - Support for Blackwell SM100 cpasync kernel.
- Collective mainloop codes: cpasync mainloop.
 - Kernel codes: cpasync kernel.
 
 - Support Blackwell SM120 mixed input blockscaled grouped GEMM.
 - Instantiating more Blackwell kernels in profiler.
- Blackwell SM100 and SM103 kernels support 
CUTLASS_LIBRARY_INSTANTIATION_LEVELto instantiate all possible combinations. - To use this feature, 
CUTLASS_LIBRARY_KERNELSmust be non-empty. Profiler will combineCUTLASS_LIBRARY_KERNELSandCUTLASS_LIBRARY_INSTANTIATION_LEVELto instantiate specific kernels. - Details please check Profiler Doc.
 
 - Blackwell SM100 and SM103 kernels support 
 - Fix some profiler issues:
- Modify default cluster callback values to none 0 to avoid profiler failure when these values are not set in command line.
 - Fix some no output and timeout issues.
 - Fix Pingpong Blockwise Hopper library generation.
 
 - From CUDA 13.0, the Blackwell SM101 for Thor GPUs is renamed to SM110.
- For CUDA toolkit version < 13.0, SM101 is still used for Thor GPUs.
 - For CUDA toolkit version >= 13.0, SM110 is used for Thor GPUs and SM101 is no longer valid.
 
 - Rename legacy Python API package from 
cutlasstocutlass_cppgenand add Blackwell EVT support to legacy Python interface.- Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's 
EpilogueDescriptors. - Added Blackwell SM100 EVT Emitter on the Python side and routed most emission through Hopper SM90 Emitter.
 - Added some support for running SM100 kernels via the Python interface.
 
 - Restructuring the C++ Blackwell SM100 Collective Epilogue Builder to work with the Python interface's 
 - CuTe changes:
- Fix inaccurate GridDim calculation under CuTe tutorial.
 - Add movmatrix support.
 - Fix smallest MMA-N allowed for Blackwell fp8 and fp16 gemm kernels.
 - Support fp16 accmulator for sm89 fp8 mma.
 - Shorten 
nullspaceimplementation. - Isolate and comment on 
cosizehacks. - Important documentation correction: 
E<0,1> == 1@0@1. 
 - Fix some kernel issues:
- Fix Hopper SM90 group gemm kernel to only use the commit group and wait group instead of also waiting on mbarriers.
 - Fix a tiny bug when K is large for Blackwell SM103 fp4 grouped GEMM kernel.
 
 - Add following unit tests:
 - Various improvements and fixes from the community and CUTLASS team. Thanks to everyone who submitted PRs!
 - Optimal code generation with CUDA toolkit versions 13.0U1.
 
CUTLASS 4.1.0
CuTe DSL
- Add aarch64 support, you can now pip install 
nvidia-cutlass-dslon GB200 systems! - More examples demonstrating how to use CuTe DSL to write peak-performance kernels
 - API updates
- Please refer to FUNCTIONALITY.md for details
 
 
CUTLASS C++
- Further enhance Blackwell SM100 Attention kernels in example 77.
- Add variable sequence length support for FMHA Backward kernel.
 - Add varlen test support to Backward runner.
 - Codes support empty batch sequences.
 
 - Replace 
subbyte_iteratorwithcute::recast_ptrwhen constructing logical iterators/arrays. - CuTe changes:
- Rewrite ArithTuple and ScaledBasis for robustness and clarity.
 - Remove buggy and kludgy 
get_layoutA|B|C_MNand friends from Atoms/TiledX. - Factor out 
print_latexand friends and rewrite. - Factor out 
print_svgand friends and rewrite. 
 - Support Blackwell SM100 SIMT packed fp32x2 kernels.
 - Support residual add for implicit gemm kernels.
 - Various fixes for CUTLASS C++ Python interface's EVT tracer:
- Add verifier for sm90 to report the invalid input.
 - When adding an edge to the graph, if the edge already exists, add an identity compute node to avoid having multiple parallel edges.
 - Register operations of tanh, sigmoid, exp, gelu to the python ast frontend.
 - Replace the NotImplemented Error by packing all nodes into a single topological visitor node as a fallback.
 
 - Fix profiler bugs in exhaustive perf search.
- Fix incorrect cluster shape output issue when doing exhaustive search.
 - Fix a bug in profiler grouped GEMM for setting tile scheduler swizzles, cluster shapes, and raster orders.
 
 - Fix some profiler issues.
- Complete the reference for Blackwell blockwise gemm kernels.
 - Fix incorrect regex logic for L1 test.
 
 
CUTLASS 4.0.0
CuTe DSL
CuTe DSL is a Python DSL centered around CuTe's abstractions
- Enables authoring kernels in Python to reach peak performance on NVIDIA GPUs
 - Core DSL implementation files
 - DSL quick start
 - DSL Overview
 - Educational notebooks for getting started with CuTe DSL
 
CUTLASS C++
- Support Family Specific Architecture Features which was introduced in CUDA 12.9
 - Further improved Blockwise and Groupwise GEMMs on Hopper and Blackwell
 - Enhance Blackwell SM100 Attention kernels in example 77
 - Add Blackwell SM100 implicit GEMM conv fprop/dgrad/wgrad unit tests
 - New Hopper SM90 FMHA example, similar in design to the existing Blackwell FMHA
 - Cute enhancements: CuTe C++ reduce op
 - Other functional and performance enhancements
 
CUTLASS 3.9.2
CUTLASS 3.9.1
CUTLASS 3.9.0
- Support for Blackwell SM120 kernels for GeForce GPUs in CUTLASS 3.x API:
- Collective mainloops that target for:
 - New GEMM and epilogue dispatch policies for collectives, kernel layers, and builders.
 - Blackwell SM120 epilogue and full set of EVT fusions.
 
 - Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM120 architecture:
- Blockscaled GEMM with NVFP4 input datatype and BF16 output tensor.
 - Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor with scale factor generation.
 - Blockscaled GEMM with mixed input datatype (MXFP8 and MXFP6) and BF16 output tensor.
 - Grouped GEMM with nvfp4 datatype.
 - Sparse Blockscaled GEMM with mxfp8 input datatype and BF16 output tensor.
 - Sparse Blockscaled GEMM with NVFP4 input datatype and NVFP4 output tensor.
 
 - Set of unit tests that demonstrate the usage of both sparse and dense Blackwell SM120 blockscaled GEMM.
 - Support for Blackwell SM100 Sparse kernels:
- Collective mainloop that target for
 
 - Set of example that demonstrate the usage of the 3.x API for targeting Blackwell SM100 Sparse GEMM:
 - Set of unit tests that demonstrate the usage of sparse and blockscaled sparse Blackwell SM100 GEMM.
 - A new Multi-head Latent Attention (MLA) for SM100 Blackwell architecture in CUTLASS example covers the flashMLA-like weight-absorbed decoding use-case.
 - A new FMHA Backward kernel for SM100 Blackwell architecture extends CUTLASS example to show how the five backward pass MMAs can be fused into a single kernel to achieve high performance.
 - A new distributed GEMM example for SM100 Blackwell architecture.
 - Enhancement and new support of block-wise and group-wise GEMM for Hopper and Blackwell architectures:
- Enhancement of blockwise GEMM for Hopper architecture.
 - Enhancement of groupwise GEMM for Hopper architecture.
 - Support for grouped GEMM with blockwise and groupwise scaling for Hopper architecture.
 - Support for grouped-wise GEMM in CUTLASS profiler.
 - Support for blockwise GEMM for Blackwell architecture.
 - Support for groupwise GEMM for Blackwell architecture.
 - Support for grouped GEMM with blockwise and groupwise scaling for Blackwell architecture.
 
 - Added support for enhanced kernel performance search (auto-tuning) in CUTLASS profiler:
- Sorting performance results by GFLOPs/second: Users can now sort the final performance report based on GFLOPs/second, making it easier to identify the most efficient kernels.
 - Exhaustive search for best kernel performance in GFLOPs/second: The profiler now searches for the best-performing kernel across a range of problem sizes, swizzle sizes, rasterization orders, and dynamic cluster configurations to maximize performance.
 - Performance search under a fixed GEMM shape: Enables exhaustive tuning within a fixed GEMM shape, exploring various kernel parameters to find the best configuration.
 - More detailed introductions and examples to leverage this feature can be found in profiler.md.
 
 - Support 
voidas the D element in sm100 kernel epilogues. 
CUTLASS 3.8.0
CUTLASS 3.8 is the first release that supports the NVIDIA Blackwell SM100 architecture.
For a background on Blackwell's new features, please consult the PTX documentation for CUDA 12.8.
- Support for new CuTe building blocks specifically for Blackwell SM100 architecture:
- 5th generation Blackwell Tensor Core instructions (TCGen05) via CuTe MMA atoms.
 - Extensions to Tensor Memory Accelerator via CuTe Copy atoms.
 - Exposure of Blackwell's new tensor memory (note: distinct from TMA) as 
tmemacross CuTe as a first class data locale. - Exposure of 
tmem->rmem,rmem->tmemandsmem->tmem data movement instructionsas copy atoms in CuTe. make_tmem_copy()utility method to ease creation of tiled copies for tmem copy atoms.- Support for new variants of LDSM on Blackwell via CuTe Copy atoms.
 
 - Support for new CUTLASS building blocks specifically for Blackwell SM100 architecture:
- Various narrow precision FP4, FP6, and FP8 formats as well as their block-scaled variants NVFP4, MXFP4, MXFP6, and MXFP8
 - Pipelines that implement Blackwell specific synchronization.
 - Cluster launch control API supporting preferred and fallback cluster shapes.
 - Data types including NVFP4, MXFP4, MXFP6, and MXFP8 and all their supported element and scale factor types.
 - Tile schedulers using Blackwell's Cluster Launch Control (CLC) feature to implement dynamic persistence scheduling for GEMMs, and stream-K.
 - Extensions to testbeds and reference check code for unit tests and CUTLASS profiler.
 
 - Full support for Blackwell SM100 kernels in CUTLASS 3.x API:
- Blackwell specific kernel layers that
- Implement a new warp-specialization recipe tuned specifically for Blackwell SM100 architecture.
 - Leverage all the new features such as CLC based tile scheduling, preferred cluster, and TMEM based double buffering of accumulators.
 - Support stream-K load balancing for all kernel types everywhere via composable scheduler support.
 
 - Blackwell collective mainloops that target the TCGen05 MMA instructions (both SS and TS) for
- Non-block scaled data types without support for pointer array and grouped GEMM with TMA
 - Non-block scaled data types with support for pointer array and grouped GEMM with TMA
 - Block scaled data types without support for pointer array and grouped GEMM with TMA
 - Block scaled data types with support for pointer array and grouped GEMM with TMA
 
 - Blackwell collective mainloop for convolution kernels supporting non-block scaled data types for fprop, dgrad, and wgrad.
 - New GEMM, convolution, and epilogue dispatch policies for collectives, kernel layers, and builders.
 - Blackwell epilogue that supports loading accumulators from 
tmemand full set of EVT fusions. 
 - Blackwell specific kernel layers that
 - CUTLASS library and profiler integration for block scaled data types for kernel emission, profiling, and verification.
- Support for preferred and fallback cluster shapes via profiler command line arguments parsing to set dynamic cluster shapes.
 - Support for dynamic datatypes by parsing profiler via profiler command line arguments parsing to set dynamic datatype setting in TCGen05 MMA instruction descriptors.
 - Support for mixed input GEMM kernels on Hopper in the profiler.
 
 - New CUTLASS profiler flag 
use-cuda-graphsto reduce overheads when benchmarking launch-bound kernels. - A new 3.x version of grouped GEMM to the CUTLASS library and generates kernels for Hopper and Blackwell. Now grouped GEMM support is enabled in the CUTLASS profiler (
./cutlass_profiler --operation=GroupedGemm --helpfor details). - Set of examples that demonstrate the usage of the 3.x API for targeting Blackwell SM100 architecture:
- Basic FP16 and FP8 GEMMs with minimal changes from Hopper examples, demonstrating ease of migration for off the shelf kernels using the 3.x collective builder API.
 - GEMM with opt-in collective builder schedules showcasing available recipes for Blackwell.
 - Block scaled data type GEMMs targeting Blackwell's native block scaled Tensor Cores:
 - GEMM example demonstrating Blackwell's new preferred cluster support via dynamic cluster shapes for increased occupancy.
 - GEMM with CLC based StreamK scheduler for load balancing.
 - Grouped GEMM for vanilla FP8 data inputs and NVFP4 block scaled inputs.
 - Convolution kernels for fprop, dgrad, and wgrad.
 - Fused multi-head attention fprop kernel supporting fp16/bf16/fp8 data types across head dims of 32,64, and 128.
 - A new BF16x9 GEMM kernel that emulates FP32 GEMM (SGEMM) using BF16 operations.
 
 - Set of examples that demonstrate the usage of the 3.x API for targeting Hopper architecture:
- A set of new Hopper grouped GEMM kernels that support mixed A and B datatypes.
 - A new Hopper FP8 GEMM with groupwise scaling.
 
 - Documentation updates:
- Quickstart - instantiating a Blackwell block-scaled GEMM.
 - Detailed Blackwell block-scaled GEMM functionality documentation
 - A new functionality documentation specifically for 3.x API comprehensively documenting all supported kernel types, data types, kernel features, minimum CUDA tookit support etc for 3.x supported architectures.
 - Updates to compatibility section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures, and Target Architecture.
 
 
Note: CUTLASS 3.x builds are known to be down on Windows platforms for all CUDA toolkits.
CUTLASS team is working on a fix.
CUTLASS 3.7.0
- A new Hopper blockwise scaling FP8 GEMM where the operands and block scaling tensor are staged via shared memory.
 - Distributed GEMM is an experimental pipelined Tensor Parallelism implementation utilizing existing CUTLASS kernels and CUDA runtime features, which can hide the most of communication behind computation.
 - Improved persistent grid launch for Hopper kernels with large cluster sizes (>= size of 4) using the new 
make_kernel_hardware_infoAPI as shown in example 48. - Enabled high precision accumulation for Hopper FP8 Sparse GEMM.
 
CUTLASS 3.6.0
- Hopper structured sparse GEMM.
 - A refactor to the CUTLASS 3.x convolution 
kernel::ConvUniversalAPI to bring it in line withgemm::GemmUniversal. Now the 3.x convolution API is no longer considered as a beta API. - An improved mixed input GEMM and a lookup table implementation for 
INT4xFP8scale-only mode. - EVT nodes for Top-K selection and softmax and GEMM example using those.
 - Programmatic Dependent Launch (PDL) that leverages a new Hopper feature to speedup two back-to-back kernels, and its corresponding documentations.
 - A new debugging tool, synclog, for dumping out all synchronization events from within a kernel to a file. Please see synclog documentation for details.
 - A new TMA-enabled epilogue for grouped GEMM that brings significant performance improvement, as well as its EVT support.
 - A SIMT-enabled pointer-array epilogue.
 - A new Ping-Pong kernel schedule for Grouped GEMM and some other optimizations.
 - A new instantiation strategy for CUTLASS profiler kernels along with improved documentation for instantiation level in CUTLASS profiler.
 - A new hardware support for comparisons and computations of 
cutlass::bfloat16_t - Fixed use of isnan on Windows for 
half_t.