Release v0.1.7 release · vllm-project/vllm-xpu-kernels

Commits: 18 | Files Changed: 42 | +2,647 / -349 lines

Highlights

Flexible Block Sizes for Attention: Decode and chunk-prefill attention kernels now support block sizes 16, 32, 64, 128, 192, and 256—enabling finer-grained KV-cache page configurations.
PyTorch 2.11 Upgrade: Upgraded the project to PyTorch 2.11, keeping pace with upstream framework improvements.
Host Memory Leak Fixes: Resolved two host memory leak issues in attention kernel launches and event handling.
Chunk Prefill softmax_lse Return: Chunk prefill kernel now supports returning softmax_lse via partial template specialization, enabling features like merged attention states.

[Decode/Chunk Prefill] Enable 16/32/64×n block sizes (#308) — Extends paged-decode and chunk-prefill attention to support block sizes {16, 32, 64, 128, 192, 256}. Adds dedicated decode policy variants for kv_tile=_16 and kv_tile=_32, and chunk-prefill policies with TileShapeQK[1]=_16. Routes all ≥64 block sizes through the kv_tile=_64 policy to avoid a known cross-SG reduction bug in the kv_tile=_128 path.
[Chunk Prefill] Enable return softmax_lse with partial template (#281) — Adds softmax log-sum-exp output support for chunk prefill, required for downstream merged attention state computation.

Add infinite check for topk_topp_sampler_kernel (#287) — Guards against infinite/NaN values in logits before top-k/top-p sampling, improving numerical robustness.

Fix host memory leak in attention kernel launches (#298) — Resolved a leak caused by SYCL event objects not being properly released after attention kernel submissions.
Remove addEvent to avoid potential host memory leak (#300) — Eliminated unnecessary event tracking that accumulated host memory over long-running serving sessions.
Fix FMHA bug for interleaved KV cache (#258) — Corrected flash-MHA behavior when using interleaved KV cache layout.
Fix dynamic stride alignment check (#293) — Changed alignment validation from 32-byte to 16-byte alignment for dynamic stride in chunked prefill attention.
Fix batch moe_align_block_size (#282) — Fixed random errors in batched MoE block-size alignment kernel.
Change int to size_t for moe_align kernel (#285) — Resolved potential integer overflow for large expert/token counts in MoE alignment.

GEMM: Measure perf with event list (#286) — Switched GEMM benchmarking to use SYCL event lists for more accurate and reliable performance measurement.
Refactor vectorization (#295) — Introduced dynamic vector-size selection, replacing static vectorization for improved flexibility and performance across varying data shapes.

Upgrade to PyTorch 2.11 (#155) — Project-wide upgrade to PyTorch 2.11 with corresponding CI and build adjustments.
Revert oneDNN to v3.11.2 (#310) — Rolled back oneDNN version to maintain stability.
Benchmark GEMM kernel (#277) — Added dedicated GEMM kernel benchmarking infrastructure.

Add Qwen3 on-demand test scope profiles (#301) — Added qwen3_30b_a3b and qwen3_235b_a22b model profiles to the on-demand test scope framework, covering MoE, attention, quantization, and activation kernels with model-specific shapes.
Add mini scope for test_gather_and_maybe_dequant_cache_mla (#305) — Extended minimal test scope coverage for MLA cache gather/dequant.
Update mini params of GDN (#290) — Tuned GDN test parameters for faster CI execution.
Reduce test parameter sizes for faster execution (#302) — Trimmed test matrix sizes to reduce CI wall-clock time.

Thanks to all 9 contributors for this release:

Kunshang Ji, Xinyu Chen, Qiming Zhang, Chaojun Zhang, Yizhou Wang, Yihua Xu, Yi Sheng, Zofia, Baodi

benchmark gemm kernel by @xinyu-intel in #277
fix batch moe align block size by @mayuyuace in #282
Fix moe_align kernel bug by @mayuyuace in #285
gemm: measure perf with event list by @xinyu-intel in #286
Add infinite check for topk_topp_sampler_kernel by @mayuyuace in #287
Fix host memory leak issue when launch the attn kernels by @ys950902 in #298
Remove addEvent in the repo to avoid potential host memory leak by @ys950902 in #300
Reduce test_cache.py test parameter sizes for faster execution. by @chaojun-zhang in #302
[Refactor] refactor vectorization by @zufangzhu in #295
[Test] add test scope for add qwen3 profile by @jikunshang in #301
update mini params of gdn by @mayuyuace in #290
Fix the fmha bug for interleaved_kv_cache by @yihuaxu in #258
[Test]add mini scope for test_gather_and_maybe_dequant_cache_mla by @jikunshang in #305
[ATTN] fix dynamic stride alignment check by @YizhouZ in #293
[CHUNK_PREFILL] enable return softmax_lse with partially template by @YizhouZ in #281
upgrade torch 2.11 by @jikunshang in #155
[decode][attn] Enable 16/32/64xn block size for attention by @baodii in #308
revert onednn to 3.11.2 by @jikunshang in #310

Full Changelog: v0.1.6...v0.1.7