Releases: NVIDIA/cutlass
Releases · NVIDIA/cutlass
CUTLASS 3.5.1
- Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code.
 - Exposure of L2 
cache_hints in TMA copy atoms - Exposure of raster order and tile swizzle extent in CUTLASS library profiler, and
example 48. - TMA store based and EVT supported epilogues for Hopper pointer array batched kernels.
 - A new 
GemmSparseUniversalAPI for CUTLASS 2.x Ampere kernels to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference. - CUDA host adapter extensions to support TMA descriptor construction driver APIs.
 - Inclusion of more Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler.
 - Support for residual add (beta != 0) in convolution kernels.
 - A new convolution epilogue for CUTLASS 2.x to support non-packed NHWC output.
 - A refactor of include files throughout CUTLASS core directories to reduce circular dependencies and tests to guard against them.
 - A guide for setting up VSCode to work well with CUTLASS and expanded code style guide.
 - Better support for MSVC as a host compiler.
 - Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
 - Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
 - NOTICE:
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution 
kernel::ConvUniversalAPI to bring it in line withgemm::GemmUniversal. After this, the 3.x convolution API will no longer be considered as a beta API. - Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.
 
 - Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution 
 
CUTLASS 3.5.0
- Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col.
- Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs.
 - Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion.
 - Support for Fprop, Dgrad, and Wgrad algorithms.
 - CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API.
 - NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
 
 - Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer.
 - Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x.
- Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
 - Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
 
 - 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
 - Updates to CuTe documentation for 
cute::Tensor<>, MMA atoms, and an overhauled CuTe GEMM tutorial series. - Extensions to CuTe to support L2 prefetching and TMA store+reductions.
 - Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
 - Fixes to greatly reduce build warnings.
 - Updates and bugfixes from the community (thanks!)
 
CUTLASS 3.4.1
- Statically available CUTLASS Version macros that allow for handling API changes between CUTLASS releases on the users' side.
 - Improvements for Hopper Group-GEMMs and Pointer-Array Batched GEMMs.
 - Updates and bugfixes from the community (thanks!).
 
CUTLASS 3.4.0
- Improved Mixed-input Hopper GEMMs supporting {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors tuned for optimal performance on Hopper H100.
 - Beta release of Pointer-Array Batched GEMMs utilizing TMA and Hopper H100 tensor cores now available. (Requires CUDA 12.3 or above)
 - Beta release of Group-GEMM - commonly used in optimization of Mixture-Of-Expert models, is now available on Hopper GPUs taking advantage of TMA and Hopper H100 tensor cores. (Requires CUDA 12.3 or above)
 - Ampere Sparse GEMM supports Epilogue Visitor Tree (EVT) now.
 - Impovements to NamedBarriers including details of ReservedNamedBarriers used within the CUTLASS library.
 - Improved CuTe documentation including improved clarity and depth of Quickstart, CuTe Layout, and CuTe Layout Algebra. Associated code comments, post-conditions, and details in CuTe Core Unit Tests also improved.
 
CUTLASS 3.3.0
- New Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input types with optimal performance.
 - New Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8} and upcast on operandA {s8, u8} x {fp16, bf16}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance.
 - New Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors (across s8/fp8/fp16/bf16/tf32 types) with optimal performance. As a part of this, new kernel schedules, and Copy Ops SM80_CP_ASYNC_CACHE_* were also added.
 - EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details.
 - Various subbyte enhancements like tagged device ptrs, support for vectorized copy, various operators to treat subbyte iterators as pointers, and full-fledged CuTe Tensor support.
 - Support for Clang as a host compiler.
 - Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface
 
CUTLASS 3.2.2
Bug fix for illegal memory access issue hit by Flash Attention tests in PyTorch. See #1138 for details.
CUTLASS 3.2.1
- Python support SM90 Epilogue Visitor Tree (EVT) on top of the C++ support released in 3.2.0.
 - SM80 EVT support in C++ and Python.
 - Other SM90 epilogue improvements.
 - Splitting CUTLASS library into smaller units based on operation, arch and datatypes. See #1105 for details.
 - Making tools/library/scripts packageable - tools/library/scripts is now moving to python/cutlass_library. See the Python README for details.
 - SM90 TF32 kernel improvements for all layouts.
 - SM90 rasterization direction support in the CUTLASS profiler.
 - Improvement for CUTLASS profiler build times.
 - Remove Python-C++ bindings.
 
CUTLASS 3.2
- New warp-specialized persistent FP8 GEMM kernel kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. An example showcasing Hopper warp-specialized FP8 GEMMs.
 - New Epilogue Visitor Tree (EVT) support for Hopper TMA epilogues. EVTs allows for user-defined customized epilogue fusion patterns without having to write a new epilogue.
 - Stream-K feature for Hopper. Note that this is only a functional implementation of stream-K, and should not be used for performance comparison. Optimizations are expected in a future release.
 - Improved CTA rasterization and support for CTA swizzling for Hopper kernels using the Tile Scheduler.
 - Improved performance for warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
 - Hopper GEMM+Permute, an example of fusing tensor reordering (permutation) with GEMM mainloop or epilogue.
 - New CUTLASS 2D Convolution Python interface. New example here.
 - Support for Windows (MSVC) builds.
 
CUTLASS 3.1
- New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details here and new examples.
 - New efficient epilogues using TMA for Hopper.
 - Support for fused epilogues, such Bias, ReLU and GELU, using the new efficient epilogues.
 - New warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
 - New warp-specialized persistent cooperative kernel design that allows for larger tile sizes and improves performance on Hopper.
 - An example showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper.
 - Epilogue builders. Similar to mainloop builders (see example 49), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
 - Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
 - Performance optimizations for the warp-specialized persistent ping-pong kernel.
 - Changes to the GEMM API 3.x, involving the host-facing arguments and the underlying 
Paramsstructs. - FMHA Backward Pass from Meta xFormers.
 - Streamk GEMM with Broadcast enables epilogue broadcast with StreamK GEMM.
 - Batched B2B GEMM now can run multiple Back-to-Back GEMM with the same problem size in parallel.
 - Batched Strided GEMV support both row major and column major input matrix.
 - Permute + GEMM fusion can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue.
 - Row Broadcast can be fused in the epilogue.
 - The GitHub branch is renamed from 
mastertomainin this release. - Optimal performance using CUDA 12.1
 - Updates and bugfixes from the community (thanks!)
 
CUTLASS 3.0
3.0.0 (2023-01-23)
- CuTe, a new core library and backend for CUTLASS 3.0 that defines a single Layout vocabulary type and an associated algebra of layouts for a much more expressive and composable abstraction for tensors, sets of parallel agents, and operations by said agents on tensors.
 - A new conceptual operation hierarchy that replaces the architecture-centric hierarchy of CUTLASS 2.x and documentation for CUTLASS 3.0's GEMM API changes.
 - Strict API backwards compatibility that exposes both 2.x and 3.x API kernels through the same 
device::GemmUniversalAdapterandkernel::GemmUniversaltypes, allowing users to include both APIs in the same translation units. More information can be found in the 3.x backwards compatibility section. - Updates to Functionality which directs users on which kernels are supported via CUTLASS-2 and CUTLASS-3.
 - Updates to Compatibility Section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures and Target Architecture.
 - New warp-specialized GEMM kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters.
 - Extensions to CUTLASS profiler to support threadblock cluster shapes in library and profiler tile configurations.
 - CUTLASS library integration for 3.x API kernels built through the new 
CollectiveBuilderAPI, enabling CUTLASS profiler. - Support for Hopper GEMMs through the new 3.0 API with CuTe-based exposure of the Hopper Tensor Memory Accelerator and WGMMA Tensor Core features.
 - Set of examples that demonstrate the usage of the new 3.0 API to easily build GEMM kernels targeting Hopper: examples 48, 49, and 50.