Skip to content

Conversation

@Anerudhan
Copy link
Collaborator

cudnn frontend v1.15 release notes

cudnn frontend v1.15 is the preferred cudnn frontend version for cuDNN version 9.13.1 and above.

New API

  • Introduced a new cudnn.Graph API that enables interoperability between torch.tensors and the cudnn frontend API. Sample code for performing a matmul with bias addition:
B, M, N, K = 16, 128, 128, 512

a_gpu = torch.randn(B, M, K, device="cuda", dtype=torch.bfloat16)
b_gpu = torch.randn(B, K, N, device="cuda", dtype=torch.bfloat16)
d_gpu = torch.randn(1, M, N, device="cuda", dtype=torch.bfloat16)

with cudnn.Graph(
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
    inputs=["mm::A", "mm::B", "bias::bias"],
    outputs=["bias::OUT_0"],
) as graph:
    AB = graph.matmul(
        name="mm",
        A=a_gpu,
        B=b_gpu,
    )
    C = graph.bias(name="bias", input=AB, bias=d_gpu)
    C.set_output(True)

c_gpu = graph(a_gpu, b_gpu, d_gpu, handle=handle)

All notebooks under samples/python have been updated to showcase the flexibility of this API.

  • cudnn frontend now supports building editable pip wheels in place.
  • The cudnn frontend Graph now includes a warmup method that triggers kernel loading by performing a fake graph capture. This improves the startup time for running the initial kernel in the actual run and prevents deadlocks when used with other modules (e.g., NCCL).

Improvements

SDPA

  • Introduced set_score_max and set_score_sum_exp to allow the kernel to output max attention score and sum of exponents.
  • Updated support surface checks. (SDPA bprop does not support the combination of s_q==1 and s_kv==1.)
  • SDPA bprop now automatically applies a padding mask if the sequence length is not a multiple of the tile size.

Matmul

  • Added support for COMPLEX_FP32 and COMPLEX_FP64 datatypes. (Requires cuDNN v9.14.0 or later.)

Normalizations

  • Updated samples to prioritize fe::HeurMode_t::A over fe::HeurMode_t::FALLBACK.

Others

  • Added support for a new parameter to enable negative scales in the Block Scale DeQuantize operation.
  • Improved logging to clearly illustrate the different stages of graph creation.
  • The swish function now accepts a swish_beta parameter.

Samples

  • Added samples demonstrating how to perform sink attention forward and backward propagation with the C++ API. (Requires cuDNN v9.13.0 or later.)
  • Added samples demonstrating "Block Scale Matmul Quantize". (Requires cuDNN v9.14.0 or later.)
  • Added a sample demonstrating how ragged (packed) tensors work with cuDNN SDPA (test_sdpa_with_caching.py). The sample also demonstrates simple caching and graph capture techniques that can improve execution time.

Bug Fixes

  • Fixed an issue where the SDPA node was accessing tensor dimensions before they were inferred, leading to a crash.

Benchmarks

  • Updated results with cuDNN 9.13.1 for B200 and GB300.

Issues Resolved

…NN version 9.13.1](https://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html#cudnn-9-13-1) and above.

- Introduced a new `cudnn.Graph` API that enables interoperability between `torch.tensors` and the cudnn frontend API. Sample code for performing a matmul with bias addition:
```
B, M, N, K = 16, 128, 128, 512

a_gpu = torch.randn(B, M, K, device="cuda", dtype=torch.bfloat16)
b_gpu = torch.randn(B, K, N, device="cuda", dtype=torch.bfloat16)
d_gpu = torch.randn(1, M, N, device="cuda", dtype=torch.bfloat16)

with cudnn.Graph(
    intermediate_data_type=cudnn.data_type.FLOAT,
    compute_data_type=cudnn.data_type.FLOAT,
    inputs=["mm::A", "mm::B", "bias::bias"],
    outputs=["bias::OUT_0"],
) as graph:
    AB = graph.matmul(
        name="mm",
        A=a_gpu,
        B=b_gpu,
    )
    C = graph.bias(name="bias", input=AB, bias=d_gpu)
    C.set_output(True)

c_gpu = graph(a_gpu, b_gpu, d_gpu, handle=handle)
```

All notebooks under [samples/python](samples/python) have been updated to showcase the flexibility of this API.

- cudnn frontend now supports building editable pip wheels in place.
- The cudnn frontend `Graph` now includes a `warmup` method that triggers kernel loading by performing a fake graph capture. This improves the startup time for running the initial kernel in the actual run and prevents deadlocks when used with other modules (e.g., NCCL).

- Introduced `set_score_max` and `set_score_sum_exp` to allow the kernel to output `max attention score` and `sum of exponents`.
- Updated support surface checks. (SDPA bprop does not support the combination of `s_q==1` and `s_kv==1`.)
- SDPA bprop now automatically applies a padding mask if the sequence length is not a multiple of the tile size.

- Added support for `COMPLEX_FP32` and `COMPLEX_FP64` datatypes. (Requires cuDNN v9.14.0 or later.)

- Updated samples to prioritize `fe::HeurMode_t::A` over `fe::HeurMode_t::FALLBACK`.

- Added support for a new parameter to enable negative scales in the Block Scale DeQuantize operation.
- Improved logging to clearly illustrate the different stages of graph creation.
- The `swish` function now accepts a `swish_beta` parameter.

- Added samples demonstrating how to perform sink attention forward and backward propagation with the C++ API. (Requires cuDNN v9.13.0 or later.)
- Added samples demonstrating "Block Scale Matmul Quantize". (Requires cuDNN v9.14.0 or later.)
- Added a sample demonstrating how ragged (packed) tensors work with cuDNN SDPA ([test_sdpa_with_caching.py](test/python/test_sdpa_with_caching.py)). The sample also demonstrates simple caching and graph capture techniques that can improve execution time.

- Fixed an issue where the SDPA node was accessing tensor dimensions before they were inferred, leading to a crash.

- Updated results with cuDNN 9.13.1 for B200 and GB300.

- [https://github.com/NVIDIA/cudnn-frontend/issues/160](https://github.com/NVIDIA/cudnn-frontend/issues/160)
- [https://github.com/NVIDIA/cudnn-frontend/issues/152](https://github.com/NVIDIA/cudnn-frontend/issues/152)
@Anerudhan Anerudhan merged commit 0b1577c into main Oct 10, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants