cudnn frontend v1.15.0 #174

Anerudhan · 2025-10-08T00:31:08Z

cudnn frontend v1.15 release notes

cudnn frontend v1.15 is the preferred cudnn frontend version for cuDNN version 9.13.1 and above.

New API

Introduced a new cudnn.Graph API that enables interoperability between torch.tensors and the cudnn frontend API. Sample code for performing a matmul with bias addition:

B, M, N, K = 16, 128, 128, 512

a_gpu = torch.randn(B, M, K, device="cuda", dtype=torch.bfloat16)
b_gpu = torch.randn(B, K, N, device="cuda", dtype=torch.bfloat16)
d_gpu = torch.randn(1, M, N, device="cuda", dtype=torch.bfloat16)

with cudnn.Graph(
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
    inputs=["mm::A", "mm::B", "bias::bias"],
    outputs=["bias::OUT_0"],
) as graph:
    AB = graph.matmul(
        name="mm",
        A=a_gpu,
        B=b_gpu,
    )
    C = graph.bias(name="bias", input=AB, bias=d_gpu)
    C.set_output(True)

c_gpu = graph(a_gpu, b_gpu, d_gpu, handle=handle)

All notebooks under samples/python have been updated to showcase the flexibility of this API.

cudnn frontend now supports building editable pip wheels in place.
The cudnn frontend Graph now includes a warmup method that triggers kernel loading by performing a fake graph capture. This improves the startup time for running the initial kernel in the actual run and prevents deadlocks when used with other modules (e.g., NCCL).

Improvements

SDPA

Introduced set_score_max and set_score_sum_exp to allow the kernel to output max attention score and sum of exponents.
Updated support surface checks. (SDPA bprop does not support the combination of s_q==1 and s_kv==1.)
SDPA bprop now automatically applies a padding mask if the sequence length is not a multiple of the tile size.

Matmul

Added support for COMPLEX_FP32 and COMPLEX_FP64 datatypes. (Requires cuDNN v9.14.0 or later.)

Normalizations

Updated samples to prioritize fe::HeurMode_t::A over fe::HeurMode_t::FALLBACK.

Others

Added support for a new parameter to enable negative scales in the Block Scale DeQuantize operation.
Improved logging to clearly illustrate the different stages of graph creation.
The swish function now accepts a swish_beta parameter.

Samples

Added samples demonstrating how to perform sink attention forward and backward propagation with the C++ API. (Requires cuDNN v9.13.0 or later.)
Added samples demonstrating "Block Scale Matmul Quantize". (Requires cuDNN v9.14.0 or later.)
Added a sample demonstrating how ragged (packed) tensors work with cuDNN SDPA (test_sdpa_with_caching.py). The sample also demonstrates simple caching and graph capture techniques that can improve execution time.

Bug Fixes

Fixed an issue where the SDPA node was accessing tensor dimensions before they were inferred, leading to a crash.

Benchmarks

Updated results with cuDNN 9.13.1 for B200 and GB300.

Issues Resolved

…NN version 9.13.1](https://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html#cudnn-9-13-1) and above. - Introduced a new `cudnn.Graph` API that enables interoperability between `torch.tensors` and the cudnn frontend API. Sample code for performing a matmul with bias addition: ``` B, M, N, K = 16, 128, 128, 512 a_gpu = torch.randn(B, M, K, device="cuda", dtype=torch.bfloat16) b_gpu = torch.randn(B, K, N, device="cuda", dtype=torch.bfloat16) d_gpu = torch.randn(1, M, N, device="cuda", dtype=torch.bfloat16) with cudnn.Graph( intermediate_data_type=cudnn.data_type.FLOAT, compute_data_type=cudnn.data_type.FLOAT, inputs=["mm::A", "mm::B", "bias::bias"], outputs=["bias::OUT_0"], ) as graph: AB = graph.matmul( name="mm", A=a_gpu, B=b_gpu, ) C = graph.bias(name="bias", input=AB, bias=d_gpu) C.set_output(True) c_gpu = graph(a_gpu, b_gpu, d_gpu, handle=handle) ``` All notebooks under [samples/python](samples/python) have been updated to showcase the flexibility of this API. - cudnn frontend now supports building editable pip wheels in place. - The cudnn frontend `Graph` now includes a `warmup` method that triggers kernel loading by performing a fake graph capture. This improves the startup time for running the initial kernel in the actual run and prevents deadlocks when used with other modules (e.g., NCCL). - Introduced `set_score_max` and `set_score_sum_exp` to allow the kernel to output `max attention score` and `sum of exponents`. - Updated support surface checks. (SDPA bprop does not support the combination of `s_q==1` and `s_kv==1`.) - SDPA bprop now automatically applies a padding mask if the sequence length is not a multiple of the tile size. - Added support for `COMPLEX_FP32` and `COMPLEX_FP64` datatypes. (Requires cuDNN v9.14.0 or later.) - Updated samples to prioritize `fe::HeurMode_t::A` over `fe::HeurMode_t::FALLBACK`. - Added support for a new parameter to enable negative scales in the Block Scale DeQuantize operation. - Improved logging to clearly illustrate the different stages of graph creation. - The `swish` function now accepts a `swish_beta` parameter. - Added samples demonstrating how to perform sink attention forward and backward propagation with the C++ API. (Requires cuDNN v9.13.0 or later.) - Added samples demonstrating "Block Scale Matmul Quantize". (Requires cuDNN v9.14.0 or later.) - Added a sample demonstrating how ragged (packed) tensors work with cuDNN SDPA ([test_sdpa_with_caching.py](test/python/test_sdpa_with_caching.py)). The sample also demonstrates simple caching and graph capture techniques that can improve execution time. - Fixed an issue where the SDPA node was accessing tensor dimensions before they were inferred, leading to a crash. - Updated results with cuDNN 9.13.1 for B200 and GB300. - [https://github.com/NVIDIA/cudnn-frontend/issues/160](https://github.com/NVIDIA/cudnn-frontend/issues/160) - [https://github.com/NVIDIA/cudnn-frontend/issues/152](https://github.com/NVIDIA/cudnn-frontend/issues/152)

Anerudhan force-pushed the 1.15.0-rc branch from 0415665 to bb37575 Compare October 9, 2025 20:14

Anerudhan merged commit 0b1577c into main Oct 10, 2025
1 check passed

weiji14 mentioned this pull request Oct 12, 2025

Torch is now required dependency? #177

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cudnn frontend v1.15.0 #174

cudnn frontend v1.15.0 #174

Uh oh!

Anerudhan commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cudnn frontend v1.15.0 #174

cudnn frontend v1.15.0 #174

Uh oh!

Conversation

Anerudhan commented Oct 8, 2025

cudnn frontend v1.15 release notes

New API

Improvements

SDPA

Matmul

Normalizations

Others

Samples

Bug Fixes

Benchmarks

Issues Resolved

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants