Skip to content

Conversation

@Anerudhan
Copy link
Collaborator

cuDNN Frontend v1.16.0 is the recommended version for cuDNN 9.15.0 and later releases.

This release introduces open-source implementations of commonly requested fused kernels for select architectures (Blackwell). These experimental kernels may require additional dependencies such as CuteDSL. The initial release includes:

Additional dependencies can be installed optionally using pip install nvidia-cudnn-frontend[cutedsl]. Usage examples and detailed documentation are available in the test/python/fe_api directory.

Please submit issue reports for additional kernel requests or bug reports.

  • Block Mask Support: Starting with cuDNN 9.14.0, SDPA attributes now support block masks to exclude tiles that do not require computation. Refer to the sample implementation for usage details.

  • Bug Fix: Resolved an invalid memory access (IMA) issue in SDPA backward propagation (fixed in cuDNN backend version 9.15.1 and later) that occurred when s_kv is not a multiple of 128, padding mask is disabled, and operations are performed in CUDA graph replay mode.

  • CUDA Graph Compatibility: Added BehaviorNote_t::CUDNN_BEHAVIOR_NOTE_CUBLASLT_DEPENDENCY as a behavior note. This enables filtering of engine configurations (execution plans) that use cuBLAS as a backend, available starting with cuDNN version 9.15.0.

  • Block Scale Quantization: Added Python bindings for block scale quantize operations (#173). Refer to the sample implementation for usage details.

  • Dependency Optimization: PyTorch is no longer a required dependency for cuDNN Frontend (#177).

  • Tensor Alignment: Enhanced tensor descriptor API to accept alignment as an attribute (#153).

  • Plan Generation Control: Updated cudnnGetPlan API to accept an optional maximum plan count parameter, enabling users to limit the number of plans built and autotuned.

  • Updated benchmark/sdpa_benchmark_training/benchmark_single_sdpa.py to use correct parameter names and fixed FLOPS calculations for accurate performance measurements.

  • #153 - Tensor descriptor alignment support

  • #173 - Block scale quantize Python bindings

  • #177 - PyTorch dependency removal

…ttps://docs.nvidia.com/deeplearning/cudnn/backend/latest/release-notes.html#cudnn-9-15-0) and later releases.

This release introduces open-source implementations of commonly requested fused kernels for select architectures (Blackwell). These experimental kernels may require additional dependencies such as CuteDSL. The initial release includes:

- [GEMM + Amax](gemm_fusions/gemm_amax.md)
- [GEMM + SwiGLU](gemm_fusions/gemm_swiglu.md)

Additional dependencies can be installed optionally using `pip install nvidia-cudnn-frontend[cutedsl]`. Usage examples and detailed documentation are available in the [test/python/fe_api](test/python/fe_api) directory.

Please submit issue reports for additional kernel requests or bug reports.

- **Block Mask Support**: Starting with cuDNN 9.14.0, SDPA attributes now support block masks to exclude tiles that do not require computation. Refer to the [sample implementation](samples/cpp/sdpa/fp16_fwd_with_block_mask.cpp) for usage details.

- **Bug Fix**: Resolved an invalid memory access (IMA) issue in SDPA backward propagation (fixed in cuDNN backend version 9.15.1 and later) that occurred when `s_kv` is not a multiple of 128, padding mask is disabled, and operations are performed in CUDA graph replay mode.

- **CUDA Graph Compatibility**: Added `BehaviorNote_t::CUDNN_BEHAVIOR_NOTE_CUBLASLT_DEPENDENCY` as a behavior note. This enables filtering of engine configurations (execution plans) that use cuBLAS as a backend, available starting with cuDNN version 9.15.0.

- **Block Scale Quantization**: Added Python bindings for block scale quantize operations ([#173](#173)). Refer to the [sample implementation](test/python/test_block_scale_quantize.py) for usage details.

- **Dependency Optimization**: PyTorch is no longer a required dependency for cuDNN Frontend ([#177](#177)).

- **Tensor Alignment**: Enhanced tensor descriptor API to accept alignment as an attribute ([#153](#153)).

- **Plan Generation Control**: Updated `cudnnGetPlan` API to accept an optional maximum plan count parameter, enabling users to limit the number of plans built and autotuned.

- Updated [benchmark/sdpa_benchmark_training/benchmark_single_sdpa.py](benchmark/sdpa_benchmark_training/benchmark_single_sdpa.py) to use correct parameter names and fixed FLOPS calculations for accurate performance measurements.

- [#153](#153) - Tensor descriptor alignment support
- [#173](#173) - Block scale quantize Python bindings
- [#177](#177) - PyTorch dependency removal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants