v1.12.1 release
This release builds on top of the 1.12.0 release.
Bug fix
- Fixes an issue where d=256 was marked not supported in Hopper
Minor Enhancements
- Addressed several comments from code review.
- Improved the cmake workflow. See PR 125
Benchmark Results
- Published results of using cuDNN backend for default
torch.sdpaop in comparison to other backend. See Llama-3.2-1B-Training for reference. - Published comparison results of sdpa() in comparison to other backends. See sdpa_benchmark_bf16_training