cuDNN Frontend v1.16.0 #179

Anerudhan · 2025-11-06T03:35:56Z

cuDNN Frontend v1.16.0 is the recommended version for cuDNN 9.15.0 and later releases.

This release introduces open-source implementations of commonly requested fused kernels for select architectures (Blackwell). These experimental kernels may require additional dependencies such as CuteDSL. The initial release includes:

Additional dependencies can be installed optionally using pip install nvidia-cudnn-frontend[cutedsl]. Usage examples and detailed documentation are available in the test/python/fe_api directory.

Please submit issue reports for additional kernel requests or bug reports.

Block Mask Support: Starting with cuDNN 9.14.0, SDPA attributes now support block masks to exclude tiles that do not require computation. Refer to the sample implementation for usage details.
Bug Fix: Resolved an invalid memory access (IMA) issue in SDPA backward propagation (fixed in cuDNN backend version 9.15.1 and later) that occurred when s_kv is not a multiple of 128, padding mask is disabled, and operations are performed in CUDA graph replay mode.
CUDA Graph Compatibility: Added BehaviorNote_t::CUDNN_BEHAVIOR_NOTE_CUBLASLT_DEPENDENCY as a behavior note. This enables filtering of engine configurations (execution plans) that use cuBLAS as a backend, available starting with cuDNN version 9.15.0.
Block Scale Quantization: Added Python bindings for block scale quantize operations (#173). Refer to the sample implementation for usage details.
Dependency Optimization: PyTorch is no longer a required dependency for cuDNN Frontend (#177).
Tensor Alignment: Enhanced tensor descriptor API to accept alignment as an attribute (#153).
Plan Generation Control: Updated cudnnGetPlan API to accept an optional maximum plan count parameter, enabling users to limit the number of plans built and autotuned.
Updated benchmark/sdpa_benchmark_training/benchmark_single_sdpa.py to use correct parameter names and fixed FLOPS calculations for accurate performance measurements.
#153 - Tensor descriptor alignment support
#173 - Block scale quantize Python bindings
#177 - PyTorch dependency removal

…ttps://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html#cudnn-9-15-0) and later releases. This release introduces open-source implementations of commonly requested fused kernels for select architectures (Blackwell). These experimental kernels may require additional dependencies such as CuteDSL. The initial release includes: - [GEMM + Amax](gemm_fusions/gemm_amax.md) - [GEMM + SwiGLU](gemm_fusions/gemm_swiglu.md) Additional dependencies can be installed optionally using `pip install nvidia-cudnn-frontend[cutedsl]`. Usage examples and detailed documentation are available in the [test/python/fe_api](test/python/fe_api) directory. Please submit issue reports for additional kernel requests or bug reports. - **Block Mask Support**: Starting with cuDNN 9.14.0, SDPA attributes now support block masks to exclude tiles that do not require computation. Refer to the [sample implementation](samples/cpp/sdpa/fp16_fwd_with_block_mask.cpp) for usage details. - **Bug Fix**: Resolved an invalid memory access (IMA) issue in SDPA backward propagation (fixed in cuDNN backend version 9.15.1 and later) that occurred when `s_kv` is not a multiple of 128, padding mask is disabled, and operations are performed in CUDA graph replay mode. - **CUDA Graph Compatibility**: Added `BehaviorNote_t::CUDNN_BEHAVIOR_NOTE_CUBLASLT_DEPENDENCY` as a behavior note. This enables filtering of engine configurations (execution plans) that use cuBLAS as a backend, available starting with cuDNN version 9.15.0. - **Block Scale Quantization**: Added Python bindings for block scale quantize operations ([#173](#173)). Refer to the [sample implementation](test/python/test_block_scale_quantize.py) for usage details. - **Dependency Optimization**: PyTorch is no longer a required dependency for cuDNN Frontend ([#177](#177)). - **Tensor Alignment**: Enhanced tensor descriptor API to accept alignment as an attribute ([#153](#153)). - **Plan Generation Control**: Updated `cudnnGetPlan` API to accept an optional maximum plan count parameter, enabling users to limit the number of plans built and autotuned. - Updated [benchmark/sdpa_benchmark_training/benchmark_single_sdpa.py](benchmark/sdpa_benchmark_training/benchmark_single_sdpa.py) to use correct parameter names and fixed FLOPS calculations for accurate performance measurements. - [#153](#153) - Tensor descriptor alignment support - [#173](#173) - Block scale quantize Python bindings - [#177](#177) - PyTorch dependency removal

Anerudhan force-pushed the 1.16.0-rc branch from 38c6b62 to 747a63a Compare November 6, 2025 03:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cuDNN Frontend v1.16.0 #179

cuDNN Frontend v1.16.0 #179

Uh oh!

Anerudhan commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cuDNN Frontend v1.16.0 #179

Are you sure you want to change the base?

cuDNN Frontend v1.16.0 #179

Uh oh!

Conversation

Anerudhan commented Nov 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants