Skip to content

v0.1.7 release

Latest

Choose a tag to compare

@jikunshang jikunshang released this 27 Apr 14:09
c968ba9

Commits: 18 | Files Changed: 42 | +2,647 / -349 lines


Highlights

  • Flexible Block Sizes for Attention: Decode and chunk-prefill attention kernels now support block sizes 16, 32, 64, 128, 192, and 256—enabling finer-grained KV-cache page configurations.
  • PyTorch 2.11 Upgrade: Upgraded the project to PyTorch 2.11, keeping pace with upstream framework improvements.
  • Host Memory Leak Fixes: Resolved two host memory leak issues in attention kernel launches and event handling.
  • Chunk Prefill softmax_lse Return: Chunk prefill kernel now supports returning softmax_lse via partial template specialization, enabling features like merged attention states.

New Features

Attention

  • [Decode/Chunk Prefill] Enable 16/32/64×n block sizes (#308) — Extends paged-decode and chunk-prefill attention to support block sizes {16, 32, 64, 128, 192, 256}. Adds dedicated decode policy variants for kv_tile=_16 and kv_tile=_32, and chunk-prefill policies with TileShapeQK[1]=_16. Routes all ≥64 block sizes through the kv_tile=_64 policy to avoid a known cross-SG reduction bug in the kv_tile=_128 path.
  • [Chunk Prefill] Enable return softmax_lse with partial template (#281) — Adds softmax log-sum-exp output support for chunk prefill, required for downstream merged attention state computation.

Sampling

  • Add infinite check for topk_topp_sampler_kernel (#287) — Guards against infinite/NaN values in logits before top-k/top-p sampling, improving numerical robustness.

Bug Fixes

  • Fix host memory leak in attention kernel launches (#298) — Resolved a leak caused by SYCL event objects not being properly released after attention kernel submissions.
  • Remove addEvent to avoid potential host memory leak (#300) — Eliminated unnecessary event tracking that accumulated host memory over long-running serving sessions.
  • Fix FMHA bug for interleaved KV cache (#258) — Corrected flash-MHA behavior when using interleaved KV cache layout.
  • Fix dynamic stride alignment check (#293) — Changed alignment validation from 32-byte to 16-byte alignment for dynamic stride in chunked prefill attention.
  • Fix batch moe_align_block_size (#282) — Fixed random errors in batched MoE block-size alignment kernel.
  • Change int to size_t for moe_align kernel (#285) — Resolved potential integer overflow for large expert/token counts in MoE alignment.

Performance

  • GEMM: Measure perf with event list (#286) — Switched GEMM benchmarking to use SYCL event lists for more accurate and reliable performance measurement.
  • Refactor vectorization (#295) — Introduced dynamic vector-size selection, replacing static vectorization for improved flexibility and performance across varying data shapes.

Infrastructure & Build

  • Upgrade to PyTorch 2.11 (#155) — Project-wide upgrade to PyTorch 2.11 with corresponding CI and build adjustments.
  • Revert oneDNN to v3.11.2 (#310) — Rolled back oneDNN version to maintain stability.
  • Benchmark GEMM kernel (#277) — Added dedicated GEMM kernel benchmarking infrastructure.

Testing

  • Add Qwen3 on-demand test scope profiles (#301) — Added qwen3_30b_a3b and qwen3_235b_a22b model profiles to the on-demand test scope framework, covering MoE, attention, quantization, and activation kernels with model-specific shapes.
  • Add mini scope for test_gather_and_maybe_dequant_cache_mla (#305) — Extended minimal test scope coverage for MLA cache gather/dequant.
  • Update mini params of GDN (#290) — Tuned GDN test parameters for faster CI execution.
  • Reduce test parameter sizes for faster execution (#302) — Trimmed test matrix sizes to reduce CI wall-clock time.

Contributors

Thanks to all 9 contributors for this release:

Kunshang Ji, Xinyu Chen, Qiming Zhang, Chaojun Zhang, Yizhou Wang, Yihua Xu, Yi Sheng, Zofia, Baodi

What's Changed

New Contributors

Full Changelog: v0.1.6...v0.1.7