Commits: 18 | Files Changed: 42 | +2,647 / -349 lines
Highlights
- Flexible Block Sizes for Attention: Decode and chunk-prefill attention kernels now support block sizes 16, 32, 64, 128, 192, and 256—enabling finer-grained KV-cache page configurations.
- PyTorch 2.11 Upgrade: Upgraded the project to PyTorch 2.11, keeping pace with upstream framework improvements.
- Host Memory Leak Fixes: Resolved two host memory leak issues in attention kernel launches and event handling.
- Chunk Prefill softmax_lse Return: Chunk prefill kernel now supports returning
softmax_lsevia partial template specialization, enabling features like merged attention states.
New Features
Attention
- [Decode/Chunk Prefill] Enable 16/32/64×n block sizes (#308) — Extends paged-decode and chunk-prefill attention to support block sizes {16, 32, 64, 128, 192, 256}. Adds dedicated decode policy variants for
kv_tile=_16andkv_tile=_32, and chunk-prefill policies withTileShapeQK[1]=_16. Routes all ≥64 block sizes through thekv_tile=_64policy to avoid a known cross-SG reduction bug in thekv_tile=_128path. - [Chunk Prefill] Enable return
softmax_lsewith partial template (#281) — Adds softmax log-sum-exp output support for chunk prefill, required for downstream merged attention state computation.
Sampling
- Add infinite check for
topk_topp_sampler_kernel(#287) — Guards against infinite/NaN values in logits before top-k/top-p sampling, improving numerical robustness.
Bug Fixes
- Fix host memory leak in attention kernel launches (#298) — Resolved a leak caused by SYCL event objects not being properly released after attention kernel submissions.
- Remove
addEventto avoid potential host memory leak (#300) — Eliminated unnecessary event tracking that accumulated host memory over long-running serving sessions. - Fix FMHA bug for interleaved KV cache (#258) — Corrected flash-MHA behavior when using interleaved KV cache layout.
- Fix dynamic stride alignment check (#293) — Changed alignment validation from 32-byte to 16-byte alignment for dynamic stride in chunked prefill attention.
- Fix batch
moe_align_block_size(#282) — Fixed random errors in batched MoE block-size alignment kernel. - Change
inttosize_tformoe_alignkernel (#285) — Resolved potential integer overflow for large expert/token counts in MoE alignment.
Performance
- GEMM: Measure perf with event list (#286) — Switched GEMM benchmarking to use SYCL event lists for more accurate and reliable performance measurement.
- Refactor vectorization (#295) — Introduced dynamic vector-size selection, replacing static vectorization for improved flexibility and performance across varying data shapes.
Infrastructure & Build
- Upgrade to PyTorch 2.11 (#155) — Project-wide upgrade to PyTorch 2.11 with corresponding CI and build adjustments.
- Revert oneDNN to v3.11.2 (#310) — Rolled back oneDNN version to maintain stability.
- Benchmark GEMM kernel (#277) — Added dedicated GEMM kernel benchmarking infrastructure.
Testing
- Add Qwen3 on-demand test scope profiles (#301) — Added
qwen3_30b_a3bandqwen3_235b_a22bmodel profiles to the on-demand test scope framework, covering MoE, attention, quantization, and activation kernels with model-specific shapes. - Add mini scope for
test_gather_and_maybe_dequant_cache_mla(#305) — Extended minimal test scope coverage for MLA cache gather/dequant. - Update mini params of GDN (#290) — Tuned GDN test parameters for faster CI execution.
- Reduce test parameter sizes for faster execution (#302) — Trimmed test matrix sizes to reduce CI wall-clock time.
Contributors
Thanks to all 9 contributors for this release:
Kunshang Ji, Xinyu Chen, Qiming Zhang, Chaojun Zhang, Yizhou Wang, Yihua Xu, Yi Sheng, Zofia, Baodi
What's Changed
- benchmark gemm kernel by @xinyu-intel in #277
- fix batch moe align block size by @mayuyuace in #282
- Fix moe_align kernel bug by @mayuyuace in #285
- gemm: measure perf with event list by @xinyu-intel in #286
- Add infinite check for topk_topp_sampler_kernel by @mayuyuace in #287
- Fix host memory leak issue when launch the attn kernels by @ys950902 in #298
- Remove addEvent in the repo to avoid potential host memory leak by @ys950902 in #300
- Reduce test_cache.py test parameter sizes for faster execution. by @chaojun-zhang in #302
- [Refactor] refactor vectorization by @zufangzhu in #295
- [Test] add test scope for add qwen3 profile by @jikunshang in #301
- update mini params of gdn by @mayuyuace in #290
- Fix the fmha bug for interleaved_kv_cache by @yihuaxu in #258
- [Test]add mini scope for test_gather_and_maybe_dequant_cache_mla by @jikunshang in #305
- [ATTN] fix dynamic stride alignment check by @YizhouZ in #293
- [CHUNK_PREFILL] enable return softmax_lse with partially template by @YizhouZ in #281
- upgrade torch 2.11 by @jikunshang in #155
- [decode][attn] Enable 16/32/64xn block size for attention by @baodii in #308
- revert onednn to 3.11.2 by @jikunshang in #310
New Contributors
Full Changelog: v0.1.6...v0.1.7