You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This release introduces open-source implementations of commonly requested fused kernels for select architectures (Blackwell). These experimental kernels may require additional dependencies such as CuteDSL. The initial release includes:
- [GEMM + Amax](gemm_fusions/gemm_amax.md)
- [GEMM + SwiGLU](gemm_fusions/gemm_swiglu.md)
Additional dependencies can be installed optionally using `pip install nvidia-cudnn-frontend[cutedsl]`. Usage examples and detailed documentation are available in the [test/python/fe_api](test/python/fe_api) directory.
Please submit issue reports for additional kernel requests or bug reports.
- **Block Mask Support**: Starting with cuDNN 9.14.0, SDPA attributes now support block masks to exclude tiles that do not require computation. Refer to the [sample implementation](samples/cpp/sdpa/fp16_fwd_with_block_mask.cpp) for usage details.
- **Bug Fix**: Resolved an invalid memory access (IMA) issue in SDPA backward propagation (fixed in cuDNN backend version 9.15.1 and later) that occurred when `s_kv` is not a multiple of 128, padding mask is disabled, and operations are performed in CUDA graph replay mode.
- **CUDA Graph Compatibility**: Added `BehaviorNote_t::CUDNN_BEHAVIOR_NOTE_CUBLASLT_DEPENDENCY` as a behavior note. This enables filtering of engine configurations (execution plans) that use cuBLAS as a backend, available starting with cuDNN version 9.15.0.
- **Block Scale Quantization**: Added Python bindings for block scale quantize operations ([#173](#173)). Refer to the [sample implementation](test/python/test_block_scale_quantize.py) for usage details.
- **Dependency Optimization**: PyTorch is no longer a required dependency for cuDNN Frontend ([#177](#177)).
- **Tensor Alignment**: Enhanced tensor descriptor API to accept alignment as an attribute ([#153](#153)).
- **Plan Generation Control**: Updated `cudnnGetPlan` API to accept an optional maximum plan count parameter, enabling users to limit the number of plans built and autotuned.
- Updated [benchmark/sdpa_benchmark_training/benchmark_single_sdpa.py](benchmark/sdpa_benchmark_training/benchmark_single_sdpa.py) to use correct parameter names and fixed FLOPS calculations for accurate performance measurements.
- [#153](#153) - Tensor descriptor alignment support
- [#173](#173) - Block scale quantize Python bindings
- [#177](#177) - PyTorch dependency removal
pyt_cudnn:: Median (fwd, bwd) Execution Times: 24.645 ms (1428 TFLOPS), 78.674 ms (1118 TFLOPS) (max difference vs. pyt_reference: 0.000000 from 10 iterations)
cudnn_fe:: Median (fwd, bwd) Execution Times: 24.543 ms (1434 TFLOPS), 73.210 ms (1201 TFLOPS) (max difference vs. pyt_reference: 0.000000 from 10 iterations)
cudnn_fe:: Median (fwd, bwd) Execution Times: 21.334 ms (1649 TFLOPS), 56.373 ms (1560 TFLOPS) (max difference vs. pyt_reference: 0.000000 from 10 iterations)
155
155
```
156
156
157
157
The cuDNN version used in the benchmark can be replaced by setting the `LD_LIBRARY_PATH` environment variable.
help="Attn mask to use. Can be 'padding_causal' or 'no_mask'. If padding_causal, is_causal must be set to false. Only works for cuDNN FE or PyTorch backends.",
68
-
choices=["padding_causal", "no_mask"],
67
+
help="Attn mask to use. Can be 'top_left', 'bottom_right', or 'no_mask'.",
68
+
choices=["top_left", "bottom_right", "no_mask"],
69
69
)
70
70
parser.add_argument(
71
71
"--sdpa_backend",
@@ -111,12 +111,6 @@
111
111
f"FP8 is only supported for cudnn_fe and flash_attention_3 backends"
112
112
)
113
113
114
-
ifargs.attn_mask=="padding_causal":
115
-
assertnotargs.is_causal, "Padding causal attn mask requires is_causal to be false"
0 commit comments