Skip to content

Conversation

@Anerudhan
Copy link
Collaborator

@Anerudhan Anerudhan commented Jul 17, 2025

cudnn frontend v1.13 release notes

cudnn frontend v1.13 is the preferred cudnn frontend version for cudnn version 9.11.0 and above.

New API

Introduces device descriptor, which allows for device-less compilation of cudnn graph on a target GPU. See newly added sample and documentation.

Improvements

SDPA

  • Introduced generate_stats as an alias to is_inference. generate_stats will be used to control the stat tensor dump. is_inference is now deprecated usage.

  • Improved support checks for left and right diagonal bands in conjunction with the diagonal alignment.

  • Improved error handling for large head dimension (d > 128) in sdpa bprop.

Normalizations

Others

  • Published improved SDPA training benchmarks for fp8 and fp16/bf16 graph patterns.

  • Enable int4 Weight only Quantization for matmul. See example

  • Allow block scale dequantize (required for low precision matmul) to take 2-D scale factor.

  • Allow reductions to accept deterministic as a attribute.

  • Added pybinds for block scale dequantize.

Bug Fixes

  • Fixed the sliding window attn_score_modifier function allowing it to set true negative infinity.

cudnn frontend v1.13 is the preferred cudnn frontend version for [cudnn version 9.11.0](https://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html#cudnn-9-11-0) and above.

Introduces device descriptor, which allows for device-less compilation of cudnn graph on a target GPU. See newly added [sample](samples/cpp/misc/deviceless_aot_compilation.cpp) and documentation.

- Introduced `generate_stats` as an alias to `is_inference`. `generate_stats` will be used to control the stat tensor dump. `is_inference` is now deprecated usage.

- Improved support checks for left and right diagonal bands in conjunction with the diagonal alignment.

- Improved error handling for large head dimension (d > 128) in sdpa bprop.

- Added support for fused Layernorm with Relu and samples for [Layernorm with relu bitmask dump](samples/cpp/norm/layernorm_bitmask_relu.cpp)

- Published improved SDPA training benchmarks for fp8 and fp16/bf16 graph patterns.

- Enable int4 Weight only Quantization for matmul. See [example](samples/cpp/int4_woq_matmul.cpp)

- Allow block scale dequantize (required for low precision matmul) to take 2-D scale factor.

- Allow reductions to accept deterministic as a attribute.

- Added pybinds for block scale dequantize.

- Fixed the sliding window attn_score_modifier function allowing it to set true negative infinity.
@Anerudhan Anerudhan merged commit 9793df5 into main Jul 17, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants