Release v1.15.0-release · NVIDIA/cudnn-frontend

cudnn frontend v1.15 release notes

cudnn frontend v1.15 is the preferred cudnn frontend version for cuDNN version 9.13.1 and above.

New API

Introduced a new cudnn.Graph API that enables interoperability between torch.tensors and the cudnn frontend API. Sample code for performing a matmul with bias addition:

B, M, N, K = 16, 128, 128, 512

a_gpu = torch.randn(B, M, K, device="cuda", dtype=torch.bfloat16)
b_gpu = torch.randn(B, K, N, device="cuda", dtype=torch.bfloat16)
d_gpu = torch.randn(1, M, N, device="cuda", dtype=torch.bfloat16)

with cudnn.Graph(
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
    inputs=["mm::A", "mm::B", "bias::bias"],
    outputs=["bias::OUT_0"],
) as graph:
    AB = graph.matmul(
        name="mm",
        A=a_gpu,
        B=b_gpu,
    )
    C = graph.bias(name="bias", input=AB, bias=d_gpu)
    C.set_output(True)

c_gpu = graph(a_gpu, b_gpu, d_gpu, handle=handle)

All notebooks under samples/python have been updated to showcase the flexibility of this API.

cudnn frontend now supports building editable pip wheels in place.
The cudnn frontend Graph now includes a warmup method that triggers kernel loading by performing a fake graph capture. This improves the startup time for running the initial kernel in the actual run and prevents deadlocks when used with other modules (e.g., NCCL).

Improvements

SDPA

Introduced set_score_max and set_score_sum_exp to allow the kernel to output max attention score and sum of exponents.
Updated support surface checks. (SDPA bprop does not support the combination of s_q==1 and s_kv==1.)
SDPA bprop now automatically applies a padding mask if the sequence length is not a multiple of the tile size.

Matmul

Added support for COMPLEX_FP32 and COMPLEX_FP64 datatypes. (Requires cuDNN v9.14.0 or later.)

Normalizations

Updated samples to prioritize fe::HeurMode_t::A over fe::HeurMode_t::FALLBACK.

Others

Added support for a new parameter to enable negative scales in the Block Scale DeQuantize operation.
Improved logging to clearly illustrate the different stages of graph creation.
The swish function now accepts a swish_beta parameter.

Samples

Added samples demonstrating how to perform sink attention forward and backward propagation with the C++ API. (Requires cuDNN v9.13.0 or later.)
Added samples demonstrating "Block Scale Matmul Quantize". (Requires cuDNN v9.14.0 or later.)
Added a sample demonstrating how ragged (packed) tensors work with cuDNN SDPA (test_sdpa_with_caching.py). The sample also demonstrates simple caching and graph capture techniques that can improve execution time.

Bug Fixes

Fixed an issue where the SDPA node was accessing tensor dimensions before they were inferred, leading to a crash.

Benchmarks

Updated results with cuDNN 9.13.1 for B200 and GB300.

Issues Resolved

#160
#152

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.15.0-release

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

cudnn frontend v1.15 release notes

New API

Improvements

SDPA

Matmul

Normalizations

Others

Samples

Bug Fixes

Benchmarks

Issues Resolved

Uh oh!