Releases · NVIDIA/cutlass

29 Aug 20:15

hwu36

v3.5.1

f7b19de

CUTLASS 3.5.1

Minimal SM90 WGMMA + TMA GEMM example in 100 lines of code.
Exposure of L2 cache_hints in TMA copy atoms
Exposure of raster order and tile swizzle extent in CUTLASS library profiler, and
example 48.
TMA store based and EVT supported epilogues for Hopper pointer array batched kernels.
A new GemmSparseUniversal API for CUTLASS 2.x Ampere kernels to enable serial and parallel split-k for sparse tensor cores and new tiny tile sizes to better support LLM inference.
CUDA host adapter extensions to support TMA descriptor construction driver APIs.
Inclusion of more Hopper fprop, dgrad, and wgrad convolution kernels in CUTLASS library and profiler.
Support for residual add (beta != 0) in convolution kernels.
A new convolution epilogue for CUTLASS 2.x to support non-packed NHWC output.
A refactor of include files throughout CUTLASS core directories to reduce circular dependencies and tests to guard against them.
A guide for setting up VSCode to work well with CUTLASS and expanded code style guide.
Better support for MSVC as a host compiler.
Many performance optimizations, improvements, and bug fixes including fixes for FlashAttention-2.
Optimal code generation with CUDA toolkit versions 12.4 and 12.5u1.
NOTICE:
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the CUTLASS 3.x convolution kernel::ConvUniversal API to bring it in line with gemm::GemmUniversal. After this, the 3.x convolution API will no longer be considered as a beta API.
- Upcoming CUTLASS 3.6 release will include a breaking refactor to the Hopper TMA pointer array batched epilogue in order to support grouped GEMMs.

Assets 2

0 Join discussion

12 Apr 01:40

hwu36

v3.5.0

7d49e6c

CUTLASS 3.5.0

Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col.
- Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs.
- Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion.
- Support for Fprop, Dgrad, and Wgrad algorithms.
- CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API.
- NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer.
Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x.
- Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
- Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
Updates to CuTe documentation for cute::Tensor<>, MMA atoms, and an overhauled CuTe GEMM tutorial series.
Extensions to CuTe to support L2 prefetching and TMA store+reductions.
Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
Fixes to greatly reduce build warnings.
Updates and bugfixes from the community (thanks!)

Assets 2

0 Join discussion

15 Feb 21:03

hwu36

v3.4.1

bbe579a

CUTLASS 3.4.1

Statically available CUTLASS Version macros that allow for handling API changes between CUTLASS releases on the users' side.
Improvements for Hopper Group-GEMMs and Pointer-Array Batched GEMMs.
Updates and bugfixes from the community (thanks!).

Assets 2

0 Join discussion

16 Jan 22:39

hwu36

v3.4.0

751eb9a

CUTLASS 3.4.0

Improved Mixed-input Hopper GEMMs supporting {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors tuned for optimal performance on Hopper H100.
Beta release of Pointer-Array Batched GEMMs utilizing TMA and Hopper H100 tensor cores now available. (Requires CUDA 12.3 or above)
Beta release of Group-GEMM - commonly used in optimization of Mixture-Of-Expert models, is now available on Hopper GPUs taking advantage of TMA and Hopper H100 tensor cores. (Requires CUDA 12.3 or above)
Ampere Sparse GEMM supports Epilogue Visitor Tree (EVT) now.
Impovements to NamedBarriers including details of ReservedNamedBarriers used within the CUTLASS library.
Improved CuTe documentation including improved clarity and depth of Quickstart, CuTe Layout, and CuTe Layout Algebra. Associated code comments, post-conditions, and details in CuTe Core Unit Tests also improved.

Assets 2

1 Join discussion

06 Dec 01:55

hwu36

v3.3.0

a75b4ac

CUTLASS 3.3.0

New Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input types with optimal performance.
New Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8} and upcast on operandA {s8, u8} x {fp16, bf16}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance.
New Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors (across s8/fp8/fp16/bf16/tf32 types) with optimal performance. As a part of this, new kernel schedules, and Copy Ops SM80_CP_ASYNC_CACHE_* were also added.
EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details.
Various subbyte enhancements like tagged device ptrs, support for vectorized copy, various operators to treat subbyte iterators as pointers, and full-fledged CuTe Tensor support.
Support for Clang as a host compiler.
Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface

Assets 2

0 Join discussion

26 Oct 18:17

hwu36

v3.2.2

44c704e

CUTLASS 3.2.2

Bug fix for illegal memory access issue hit by Flash Attention tests in PyTorch. See #1138 for details.

Assets 2

1 Join discussion

26 Sep 21:47

hwu36

v3.2.1

5cd735c

CUTLASS 3.2.1

Python support SM90 Epilogue Visitor Tree (EVT) on top of the C++ support released in 3.2.0.
SM80 EVT support in C++ and Python.
Other SM90 epilogue improvements.
Splitting CUTLASS library into smaller units based on operation, arch and datatypes. See #1105 for details.
Making tools/library/scripts packageable - tools/library/scripts is now moving to python/cutlass_library. See the Python README for details.
SM90 TF32 kernel improvements for all layouts.
SM90 rasterization direction support in the CUTLASS profiler.
Improvement for CUTLASS profiler build times.
Remove Python-C++ bindings.

Assets 2

0 Join discussion

28 Aug 00:50

hwu36

v3.2.0

3a8f57a

CUTLASS 3.2

New warp-specialized persistent FP8 GEMM kernel kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. An example showcasing Hopper warp-specialized FP8 GEMMs.
New Epilogue Visitor Tree (EVT) support for Hopper TMA epilogues. EVTs allows for user-defined customized epilogue fusion patterns without having to write a new epilogue.
Stream-K feature for Hopper. Note that this is only a functional implementation of stream-K, and should not be used for performance comparison. Optimizations are expected in a future release.
Improved CTA rasterization and support for CTA swizzling for Hopper kernels using the Tile Scheduler.
Improved performance for warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
Hopper GEMM+Permute, an example of fusing tensor reordering (permutation) with GEMM mainloop or epilogue.
New CUTLASS 2D Convolution Python interface. New example here.
Support for Windows (MSVC) builds.

Assets 2

0 Join discussion

24 May 20:10

hwu36

v3.1.0

6f47420

CUTLASS 3.1

New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details here and new examples.
New efficient epilogues using TMA for Hopper.
Support for fused epilogues, such Bias, ReLU and GELU, using the new efficient epilogues.
New warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
New warp-specialized persistent cooperative kernel design that allows for larger tile sizes and improves performance on Hopper.
An example showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper.
Epilogue builders. Similar to mainloop builders (see example 49), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
Performance optimizations for the warp-specialized persistent ping-pong kernel.
Changes to the GEMM API 3.x, involving the host-facing arguments and the underlying Params structs.
FMHA Backward Pass from Meta xFormers.
Streamk GEMM with Broadcast enables epilogue broadcast with StreamK GEMM.
Batched B2B GEMM now can run multiple Back-to-Back GEMM with the same problem size in parallel.
Batched Strided GEMV support both row major and column major input matrix.
Permute + GEMM fusion can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue.
Row Broadcast can be fused in the epilogue.
The GitHub branch is renamed from master to main in this release.
Optimal performance using CUDA 12.1
Updates and bugfixes from the community (thanks!)

Assets 2

1 Join discussion

10 Mar 04:19

hwu36

v3.0.0

c4f6b8c

CUTLASS 3.0

3.0.0 (2023-01-23)

CuTe, a new core library and backend for CUTLASS 3.0 that defines a single Layout vocabulary type and an associated algebra of layouts for a much more expressive and composable abstraction for tensors, sets of parallel agents, and operations by said agents on tensors.
A new conceptual operation hierarchy that replaces the architecture-centric hierarchy of CUTLASS 2.x and documentation for CUTLASS 3.0's GEMM API changes.
Strict API backwards compatibility that exposes both 2.x and 3.x API kernels through the same device::GemmUniversalAdapter and kernel::GemmUniversal types, allowing users to include both APIs in the same translation units. More information can be found in the 3.x backwards compatibility section.
Updates to Functionality which directs users on which kernels are supported via CUTLASS-2 and CUTLASS-3.
Updates to Compatibility Section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures and Target Architecture.
New warp-specialized GEMM kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters.
Extensions to CUTLASS profiler to support threadblock cluster shapes in library and profiler tile configurations.
CUTLASS library integration for 3.x API kernels built through the new CollectiveBuilder API, enabling CUTLASS profiler.
Support for Hopper GEMMs through the new 3.0 API with CuTe-based exposure of the Hopper Tensor Memory Accelerator and WGMMA Tensor Core features.
Set of examples that demonstrate the usage of the new 3.0 API to easily build GEMM kernels targeting Hopper: examples 48, 49, and 50.

Assets 2

Releases: NVIDIA/cutlass

CUTLASS 3.5.1

Uh oh!

CUTLASS 3.5.0

Uh oh!

CUTLASS 3.4.1

Uh oh!

CUTLASS 3.4.0

Uh oh!

CUTLASS 3.3.0

Uh oh!

CUTLASS 3.2.2

Uh oh!

CUTLASS 3.2.1

Uh oh!

CUTLASS 3.2

Uh oh!

CUTLASS 3.1

Uh oh!

CUTLASS 3.0

3.0.0 (2023-01-23)

Uh oh!