Release v0.1.6 release · vllm-project/vllm-xpu-kernels

Highlights

Fused Kernels for torch.compile: New fuse_norm_quant, fuse_act_quant, and fused_qk_norm_rope kernels enabling norm-quantization and activation-quantization fusion under torch.compile.
MXFP8 / MXFP4 oneDNN GEMM: Added microscaling FP8 and FP4 GEMM support via oneDNN, broadening low-precision inference capabilities.
Flash Attention head_dim=512: Extended flash-MHA support to head dimension 512.
Chunked Prefill dynamic stride: Added dynamic stride support for chunk prefill attention, improving flexibility for variable-length workloads.

New Features

Attention

[FMHA] Support head dimension 512 (#251) — Extends flash attention to models using 512-dim heads.
[Chunk Prefill] Add dynamic stride support (#187) — Enables dynamic stride in chunked prefill for variable-length input sequences.
[Decode Attention] Tune num_kv_splits for paged decode kernel (#257) — Improves decode-stage attention performance via better KV-split tuning.

Activation

Add fatrelu_and_mul (#259) — New FATReLU fused activation kernel.
Support relu2_no_mul (SYCL) for Nemotron-3-Nano-30B-A3B-bf16 (#232).
Support swiglustep_and_mul for Step-3.5-Flash (#199) — New SwiGLU-Step fused activation variant.

Quantization & Low-Precision

[OneDNN] Add MXFP8 and MXFP4 GEMM (#235) — Microscaling FP8/FP4 GEMM via oneDNN for next-gen low-precision inference.

Fusion (torch.compile)

Add fuse_norm_quant, fuse_act_quant, and fused_qk_norm_rope kernels (#267) — Fused normalization+quantization and QK-norm+RoPE kernels, registered as custom ops compatible with torch.compile.

MoE (Mixture of Experts)

Add topk=10 for remap_hidden_states kernel (#273) — Extends remap kernel to support topk=10 routing.
Optimize MoE GEMM (#266) — Performance improvements for MoE grouped GEMM.

Cache / Memory

Add swap_blocks_batch op with batched async memcpy (#265) — New batched block-swap operation using asynchronous memory copy for improved KV-cache management.

LoRA

Add mixed precision support for LoRA expand & shrink kernels (#230) — Enables mixed-precision (e.g., bf16/fp32) LoRA adapters.

Model Support

Support f32 ssm_state in GDN kernel for Qwen3.5 (#220) — Enables fp32 SSM state for Qwen3.5 Mamba-style models.
Follow upstream to change A_log dtype to fp32 for Qwen3.5 (#254).

Bug Fixes

Fix overflow of remap_hidden_states when rows is huge (#269) — Resolved integer overflow for large batch/sequence scenarios.
Fix XPU CPU-view tensor lifetime (#262) — Fixed use-after-free issue with CPU-view tensors on XPU.
Skip scales check (#256) — Fixed spurious validation failure in quantized GEMM path.
Skip GDN core_attn_out check for 8k length due to random numeric error (#264, #261).

Performance

Optimize MoE GEMM (#266) — Tuned grouped GEMM policies for better throughput.
Tune num_kv_splits for paged decode kernel (#257) — Improved decode attention latency.

Infrastructure & Build

Refactor CMake to enable selective kernel build (#260) — Allows building only a subset of kernels, reducing compile time and binary size.
Upgrade oneDNN to v3.11.2 (#248).
Use local LRU cache for oneDNN primitive caching (#275).
Show binary size in pre-CI (#268).
Update SCM version check and project Python version (#274).
Remove yapf (#272) — Dropped yapf formatter from the project.
Add psutil to pyproject.toml (#255).
Fix MoE benchmark (#279).

Testing

Refine test scope definition (#250) — Improved test profiling and scope control framework.
New tests: test_fused_norm_quant, test_fused_qk_norm_rope, test_fused_quant_activation, test_swiglustep_and_mul, test_fp4_gemm_onednn, test_cache (swap_blocks_batch), test_lora_ops (mixed precision).

Contributors

Thanks to all 11 contributors for this release:

Kunshang Ji, Xinyu Chen, Qiming Zhang, Qun Yang, Chaojun Zhang, Zofia, Zhefeng Qiao, Yejing Lai, Yizhou Wang, Liuzhenwei, Baodi

What's Changed

upgrade onednn to v3.11.2 by @zufangzhu in #248
Support f32 ssm_state in GDN kernel for Qwen3.5 by @YangQun1 in #220
Add mixed precision support for LoRA expand & shrink kernels by @chaojun-zhang in #230
Support swiglustep and mul for Step-3.5-Flash by @Dboyqiao in #199
[build]add psutil in pyproject.toml by @jikunshang in #255
skip scales check by @mayuyuace in #256
Support sycl impl relu2_no_mul for NVIDIA-Nemotron-3-Nano-30B-A3B-bf16 by @Dboyqiao in #232
Change gdn attn A_log dtype to fp32 for qwen3.5 by @YangQun1 in #254
[OneDNN] add mxfp8, mxfp4 onednn gemm by @zufangzhu in #235
skip gdn core_attn_out check for f32 ssm_state+8k len due to random numeric error by @YangQun1 in #261
skip gdn core_attn_out check for 8k len due to random numeric error by @YangQun1 in #264
[Decode attn] tune num_kv_splits for page decode kernel by @baodii in #257
[fmha] support head dim 512 by @xinyu-intel in #251
refactor cmake to enable selective kernel build by @xinyu-intel in #260
Optimize Moe GEMM by @mayuyuace in #266
Fix overflow of remap_hidden_states when rows is huge by @mayuyuace in #269
Add swap_blocks_batch op with batched async memcpy by @chaojun-zhang in #265
[Test] refine test socpe definition by @jikunshang in #250
Fix use-after-free in get_xpu_view_from_cpu_tensor by @chaojun-zhang in #262
[Fusion][Torch.compiler] Add fuse_norm_quant, fuse_act_quant and fused_qk_norm_rope kernel by @Yejing-Lai in #267
[Build][Lint] remove yapf by @jikunshang in #272
[OneDNN] use local lru by @zufangzhu in #275
[Build] update scm version check and project python version by @jikunshang in #274
[CHUNK_PREFILL] add dynamic_stride support by @YizhouZ in #187
add fatrelu_and_mul by @zhenwei-intel in #259
fix moe benchmark by @xinyu-intel in #279
show binary size in pre-ci by @xinyu-intel in #268
remap hidden status topk 10 by @mayuyuace in #273

New Contributors

@YangQun1 made their first contribution in #220
@Dboyqiao made their first contribution in #199

Full Changelog: v0.1.5...v0.1.6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.6 release

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

New Features

Attention

Activation

Quantization & Low-Precision

Fusion (torch.compile)

MoE (Mixture of Experts)

Cache / Memory

LoRA

Model Support

Bug Fixes

Performance

Infrastructure & Build

Testing

Contributors

What's Changed

New Contributors

Contributors

Uh oh!