cudnn frontend v1.15 release notes
cudnn frontend v1.15 is the preferred cudnn frontend version for cuDNN version 9.13.1 and above.
New API
- Introduced a new
cudnn.GraphAPI that enables interoperability betweentorch.tensorsand the cudnn frontend API. Sample code for performing a matmul with bias addition:
B, M, N, K = 16, 128, 128, 512
a_gpu = torch.randn(B, M, K, device="cuda", dtype=torch.bfloat16)
b_gpu = torch.randn(B, K, N, device="cuda", dtype=torch.bfloat16)
d_gpu = torch.randn(1, M, N, device="cuda", dtype=torch.bfloat16)
with cudnn.Graph(
intermediate_data_type=cudnn.data_type.FLOAT,
compute_data_type=cudnn.data_type.FLOAT,
inputs=["mm::A", "mm::B", "bias::bias"],
outputs=["bias::OUT_0"],
) as graph:
AB = graph.matmul(
name="mm",
A=a_gpu,
B=b_gpu,
)
C = graph.bias(name="bias", input=AB, bias=d_gpu)
C.set_output(True)
c_gpu = graph(a_gpu, b_gpu, d_gpu, handle=handle)
All notebooks under samples/python have been updated to showcase the flexibility of this API.
- cudnn frontend now supports building editable pip wheels in place.
- The cudnn frontend
Graphnow includes awarmupmethod that triggers kernel loading by performing a fake graph capture. This improves the startup time for running the initial kernel in the actual run and prevents deadlocks when used with other modules (e.g., NCCL).
Improvements
SDPA
- Introduced
set_score_maxandset_score_sum_expto allow the kernel to outputmax attention scoreandsum of exponents. - Updated support surface checks. (SDPA bprop does not support the combination of
s_q==1ands_kv==1.) - SDPA bprop now automatically applies a padding mask if the sequence length is not a multiple of the tile size.
Matmul
- Added support for
COMPLEX_FP32andCOMPLEX_FP64datatypes. (Requires cuDNN v9.14.0 or later.)
Normalizations
- Updated samples to prioritize
fe::HeurMode_t::Aoverfe::HeurMode_t::FALLBACK.
Others
- Added support for a new parameter to enable negative scales in the Block Scale DeQuantize operation.
- Improved logging to clearly illustrate the different stages of graph creation.
- The
swishfunction now accepts aswish_betaparameter.
Samples
- Added samples demonstrating how to perform sink attention forward and backward propagation with the C++ API. (Requires cuDNN v9.13.0 or later.)
- Added samples demonstrating "Block Scale Matmul Quantize". (Requires cuDNN v9.14.0 or later.)
- Added a sample demonstrating how ragged (packed) tensors work with cuDNN SDPA (test_sdpa_with_caching.py). The sample also demonstrates simple caching and graph capture techniques that can improve execution time.
Bug Fixes
- Fixed an issue where the SDPA node was accessing tensor dimensions before they were inferred, leading to a crash.
Benchmarks
- Updated results with cuDNN 9.13.1 for B200 and GB300.