Skip to content

Releases: vllm-project/vllm-xpu-kernels

v0.1.6 release

20 Apr 05:38
e095a4a

Choose a tag to compare

Highlights

  • Fused Kernels for torch.compile: New fuse_norm_quant, fuse_act_quant, and fused_qk_norm_rope kernels enabling norm-quantization and activation-quantization fusion under torch.compile.
  • MXFP8 / MXFP4 oneDNN GEMM: Added microscaling FP8 and FP4 GEMM support via oneDNN, broadening low-precision inference capabilities.
  • Flash Attention head_dim=512: Extended flash-MHA support to head dimension 512.
  • Chunked Prefill dynamic stride: Added dynamic stride support for chunk prefill attention, improving flexibility for variable-length workloads.

New Features

Attention

  • [FMHA] Support head dimension 512 (#251) — Extends flash attention to models using 512-dim heads.
  • [Chunk Prefill] Add dynamic stride support (#187) — Enables dynamic stride in chunked prefill for variable-length input sequences.
  • [Decode Attention] Tune num_kv_splits for paged decode kernel (#257) — Improves decode-stage attention performance via better KV-split tuning.

Activation

  • Add fatrelu_and_mul (#259) — New FATReLU fused activation kernel.
  • Support relu2_no_mul (SYCL) for Nemotron-3-Nano-30B-A3B-bf16 (#232).
  • Support swiglustep_and_mul for Step-3.5-Flash (#199) — New SwiGLU-Step fused activation variant.

Quantization & Low-Precision

  • [OneDNN] Add MXFP8 and MXFP4 GEMM (#235) — Microscaling FP8/FP4 GEMM via oneDNN for next-gen low-precision inference.

Fusion (torch.compile)

  • Add fuse_norm_quant, fuse_act_quant, and fused_qk_norm_rope kernels (#267) — Fused normalization+quantization and QK-norm+RoPE kernels, registered as custom ops compatible with torch.compile.

MoE (Mixture of Experts)

  • Add topk=10 for remap_hidden_states kernel (#273) — Extends remap kernel to support topk=10 routing.
  • Optimize MoE GEMM (#266) — Performance improvements for MoE grouped GEMM.

Cache / Memory

  • Add swap_blocks_batch op with batched async memcpy (#265) — New batched block-swap operation using asynchronous memory copy for improved KV-cache management.

LoRA

  • Add mixed precision support for LoRA expand & shrink kernels (#230) — Enables mixed-precision (e.g., bf16/fp32) LoRA adapters.

Model Support

  • Support f32 ssm_state in GDN kernel for Qwen3.5 (#220) — Enables fp32 SSM state for Qwen3.5 Mamba-style models.
  • Follow upstream to change A_log dtype to fp32 for Qwen3.5 (#254).

Bug Fixes

  • Fix overflow of remap_hidden_states when rows is huge (#269) — Resolved integer overflow for large batch/sequence scenarios.
  • Fix XPU CPU-view tensor lifetime (#262) — Fixed use-after-free issue with CPU-view tensors on XPU.
  • Skip scales check (#256) — Fixed spurious validation failure in quantized GEMM path.
  • Skip GDN core_attn_out check for 8k length due to random numeric error (#264, #261).

Performance

  • Optimize MoE GEMM (#266) — Tuned grouped GEMM policies for better throughput.
  • Tune num_kv_splits for paged decode kernel (#257) — Improved decode attention latency.

Infrastructure & Build

  • Refactor CMake to enable selective kernel build (#260) — Allows building only a subset of kernels, reducing compile time and binary size.
  • Upgrade oneDNN to v3.11.2 (#248).
  • Use local LRU cache for oneDNN primitive caching (#275).
  • Show binary size in pre-CI (#268).
  • Update SCM version check and project Python version (#274).
  • Remove yapf (#272) — Dropped yapf formatter from the project.
  • Add psutil to pyproject.toml (#255).
  • Fix MoE benchmark (#279).

Testing

  • Refine test scope definition (#250) — Improved test profiling and scope control framework.
  • New tests: test_fused_norm_quant, test_fused_qk_norm_rope, test_fused_quant_activation, test_swiglustep_and_mul, test_fp4_gemm_onednn, test_cache (swap_blocks_batch), test_lora_ops (mixed precision).

Contributors

Thanks to all 11 contributors for this release:

Kunshang Ji, Xinyu Chen, Qiming Zhang, Qun Yang, Chaojun Zhang, Zofia, Zhefeng Qiao, Yejing Lai, Yizhou Wang, Liuzhenwei, Baodi


What's Changed

New Contributors

Full Changelog: v0.1.5...v0.1.6

v0.1.5 release

03 Apr 04:01
1683b76

Choose a tag to compare

v0.1.5 Release Notes

This release delivers major kernel and runtime updates for Intel XPU, with a focus on MLA path coverage, quantization support, MoE performance, and CI/build stability.

Highlights

  • Added MLA kernels:
    • merge_attn_states
    • gather_and_maybe_dequant_cache
  • Improved MLA decode flexibility with support for arbitrary KV cache strides in paged decode.
  • Added/extended quantization and cache kernels:
    • FP8 w8a16 GEMM
    • MXFP4 block quant kernel
    • indexer_k_quant_and_cache and cp_gather_indexer_k_quant_cache
  • Added new kernels/features:
    • SYCL topk_per_row
    • topk_topp sampler
    • EPLB enabling kernels
  • Performance and optimization updates:
    • MoE remap kernel optimization
    • Chunk prefill tuning
    • Vectorized act-and-mul kernels
  • Runtime and API improvements:
    • Customized memory allocator for vLLM sleep mode
    • Added mem_cpy Python API

Fixes

  • Bugfix: updated binding signature for rms_norm.
  • Decode attention: adjusted num_splits strategy to avoid accuracy issues.
  • Platform workaround: route XE3/XE3P to XE2 CUTLASS kernels.
  • CI/build fixes:
    • oneDNN version compatibility fix
    • manylinux builder pinning
    • improved job estimation to reduce OOM risk

Developer Experience

  • Added prebuilt wheel install path for faster development setup.
  • Added/updated tests for pinned-memory swap blocks and indexer_k_quant_and_cache.
  • Refreshed benchmark coverage for flash attention and fused MoE.
  • Upgraded SYCL-TLA dependency revision.

Potentially Breaking / Behavior Changes

  • Removed xpu_fused_moe weights handling; downstream integrations relying on previous behavior should verify compatibility.

Included PRs (since v0.1.4)

#64, #139, #163, #165, #174, #176, #182, #188, #191, #193, #194, #195, #198, #201, #203, #204, #207, #209, #210, #211, #213, #215, #216, #219, #226, #227, #228, #233, #239, #240, #245, #246

What's Changed

New Contributors

Full Changelog: v0.1.4...v0.1.5

v0.1.4 Release

20 Mar 04:56

Choose a tag to compare

What's Changed

Full Changelog: v0.1.3...v0.1.4

v0.1.3 release

04 Mar 01:59
9c8616f

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.1.2...v0.1.3

v0.1.2 release

11 Feb 00:04
e7dee22

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v0.1.1...v0.1.2

v0.1.1 release

03 Feb 04:55
b38d248

Choose a tag to compare

v0.1.1 release Pre-release
Pre-release

several fix based on v0.1.1

What's Changed

Full Changelog: v0.1.0...v0.1.1

v0.1.0

29 Jan 09:08
b505e23

Choose a tag to compare

v0.1.0 Pre-release
Pre-release

We’re excited to announce the first release of vllm-xpu-kernels!
This release includes migrated and reimplemented core kernels from IPEX. Please note that it is a pre-production release.

What's Changed

Read more