Releases: vllm-project/vllm-xpu-kernels
v0.1.6 release
Highlights
- Fused Kernels for torch.compile: New
fuse_norm_quant,fuse_act_quant, andfused_qk_norm_ropekernels enabling norm-quantization and activation-quantization fusion undertorch.compile. - MXFP8 / MXFP4 oneDNN GEMM: Added microscaling FP8 and FP4 GEMM support via oneDNN, broadening low-precision inference capabilities.
- Flash Attention head_dim=512: Extended flash-MHA support to head dimension 512.
- Chunked Prefill dynamic stride: Added dynamic stride support for chunk prefill attention, improving flexibility for variable-length workloads.
New Features
Attention
- [FMHA] Support head dimension 512 (#251) — Extends flash attention to models using 512-dim heads.
- [Chunk Prefill] Add dynamic stride support (#187) — Enables dynamic stride in chunked prefill for variable-length input sequences.
- [Decode Attention] Tune
num_kv_splitsfor paged decode kernel (#257) — Improves decode-stage attention performance via better KV-split tuning.
Activation
- Add
fatrelu_and_mul(#259) — New FATReLU fused activation kernel. - Support
relu2_no_mul(SYCL) for Nemotron-3-Nano-30B-A3B-bf16 (#232). - Support
swiglustep_and_mulfor Step-3.5-Flash (#199) — New SwiGLU-Step fused activation variant.
Quantization & Low-Precision
- [OneDNN] Add MXFP8 and MXFP4 GEMM (#235) — Microscaling FP8/FP4 GEMM via oneDNN for next-gen low-precision inference.
Fusion (torch.compile)
- Add
fuse_norm_quant,fuse_act_quant, andfused_qk_norm_ropekernels (#267) — Fused normalization+quantization and QK-norm+RoPE kernels, registered as custom ops compatible withtorch.compile.
MoE (Mixture of Experts)
- Add
topk=10forremap_hidden_stateskernel (#273) — Extends remap kernel to support topk=10 routing. - Optimize MoE GEMM (#266) — Performance improvements for MoE grouped GEMM.
Cache / Memory
- Add
swap_blocks_batchop with batched async memcpy (#265) — New batched block-swap operation using asynchronous memory copy for improved KV-cache management.
LoRA
- Add mixed precision support for LoRA expand & shrink kernels (#230) — Enables mixed-precision (e.g., bf16/fp32) LoRA adapters.
Model Support
- Support f32
ssm_statein GDN kernel for Qwen3.5 (#220) — Enables fp32 SSM state for Qwen3.5 Mamba-style models. - Follow upstream to change
A_logdtype to fp32 for Qwen3.5 (#254).
Bug Fixes
- Fix overflow of
remap_hidden_stateswhen rows is huge (#269) — Resolved integer overflow for large batch/sequence scenarios. - Fix XPU CPU-view tensor lifetime (#262) — Fixed use-after-free issue with CPU-view tensors on XPU.
- Skip scales check (#256) — Fixed spurious validation failure in quantized GEMM path.
- Skip GDN
core_attn_outcheck for 8k length due to random numeric error (#264, #261).
Performance
- Optimize MoE GEMM (#266) — Tuned grouped GEMM policies for better throughput.
- Tune
num_kv_splitsfor paged decode kernel (#257) — Improved decode attention latency.
Infrastructure & Build
- Refactor CMake to enable selective kernel build (#260) — Allows building only a subset of kernels, reducing compile time and binary size.
- Upgrade oneDNN to v3.11.2 (#248).
- Use local LRU cache for oneDNN primitive caching (#275).
- Show binary size in pre-CI (#268).
- Update SCM version check and project Python version (#274).
- Remove yapf (#272) — Dropped yapf formatter from the project.
- Add
psutiltopyproject.toml(#255). - Fix MoE benchmark (#279).
Testing
- Refine test scope definition (#250) — Improved test profiling and scope control framework.
- New tests:
test_fused_norm_quant,test_fused_qk_norm_rope,test_fused_quant_activation,test_swiglustep_and_mul,test_fp4_gemm_onednn,test_cache(swap_blocks_batch),test_lora_ops(mixed precision).
Contributors
Thanks to all 11 contributors for this release:
Kunshang Ji, Xinyu Chen, Qiming Zhang, Qun Yang, Chaojun Zhang, Zofia, Zhefeng Qiao, Yejing Lai, Yizhou Wang, Liuzhenwei, Baodi
What's Changed
- upgrade onednn to v3.11.2 by @zufangzhu in #248
- Support f32 ssm_state in GDN kernel for Qwen3.5 by @YangQun1 in #220
- Add mixed precision support for LoRA expand & shrink kernels by @chaojun-zhang in #230
- Support swiglustep and mul for Step-3.5-Flash by @Dboyqiao in #199
- [build]add psutil in pyproject.toml by @jikunshang in #255
- skip scales check by @mayuyuace in #256
- Support sycl impl relu2_no_mul for NVIDIA-Nemotron-3-Nano-30B-A3B-bf16 by @Dboyqiao in #232
- Change gdn attn A_log dtype to fp32 for qwen3.5 by @YangQun1 in #254
- [OneDNN] add mxfp8, mxfp4 onednn gemm by @zufangzhu in #235
- skip gdn core_attn_out check for f32 ssm_state+8k len due to random numeric error by @YangQun1 in #261
- skip gdn core_attn_out check for 8k len due to random numeric error by @YangQun1 in #264
- [Decode attn] tune num_kv_splits for page decode kernel by @baodii in #257
- [fmha] support head dim 512 by @xinyu-intel in #251
- refactor cmake to enable selective kernel build by @xinyu-intel in #260
- Optimize Moe GEMM by @mayuyuace in #266
- Fix overflow of remap_hidden_states when rows is huge by @mayuyuace in #269
- Add swap_blocks_batch op with batched async memcpy by @chaojun-zhang in #265
- [Test] refine test socpe definition by @jikunshang in #250
- Fix use-after-free in get_xpu_view_from_cpu_tensor by @chaojun-zhang in #262
- [Fusion][Torch.compiler] Add fuse_norm_quant, fuse_act_quant and fused_qk_norm_rope kernel by @Yejing-Lai in #267
- [Build][Lint] remove yapf by @jikunshang in #272
- [OneDNN] use local lru by @zufangzhu in #275
- [Build] update scm version check and project python version by @jikunshang in #274
- [CHUNK_PREFILL] add dynamic_stride support by @YizhouZ in #187
- add fatrelu_and_mul by @zhenwei-intel in #259
- fix moe benchmark by @xinyu-intel in #279
- show binary size in pre-ci by @xinyu-intel in #268
- remap hidden status topk 10 by @mayuyuace in #273
New Contributors
Full Changelog: v0.1.5...v0.1.6
v0.1.5 release
v0.1.5 Release Notes
This release delivers major kernel and runtime updates for Intel XPU, with a focus on MLA path coverage, quantization support, MoE performance, and CI/build stability.
Highlights
- Added MLA kernels:
merge_attn_statesgather_and_maybe_dequant_cache
- Improved MLA decode flexibility with support for arbitrary KV cache strides in paged decode.
- Added/extended quantization and cache kernels:
- FP8
w8a16GEMM - MXFP4 block quant kernel
indexer_k_quant_and_cacheandcp_gather_indexer_k_quant_cache
- FP8
- Added new kernels/features:
- SYCL
topk_per_row topk_toppsampler- EPLB enabling kernels
- SYCL
- Performance and optimization updates:
- MoE remap kernel optimization
- Chunk prefill tuning
- Vectorized
act-and-mulkernels
- Runtime and API improvements:
- Customized memory allocator for vLLM sleep mode
- Added
mem_cpyPython API
Fixes
- Bugfix: updated binding signature for
rms_norm. - Decode attention: adjusted
num_splitsstrategy to avoid accuracy issues. - Platform workaround: route XE3/XE3P to XE2 CUTLASS kernels.
- CI/build fixes:
- oneDNN version compatibility fix
- manylinux builder pinning
- improved job estimation to reduce OOM risk
Developer Experience
- Added prebuilt wheel install path for faster development setup.
- Added/updated tests for pinned-memory swap blocks and
indexer_k_quant_and_cache. - Refreshed benchmark coverage for flash attention and fused MoE.
- Upgraded SYCL-TLA dependency revision.
Potentially Breaking / Behavior Changes
- Removed
xpu_fused_moeweights handling; downstream integrations relying on previous behavior should verify compatibility.
Included PRs (since v0.1.4)
#64, #139, #163, #165, #174, #176, #182, #188, #191, #193, #194, #195, #198, #201, #203, #204, #207, #209, #210, #211, #213, #215, #216, #219, #226, #227, #228, #233, #239, #240, #245, #246
What's Changed
- Support indexer_k_quant_and_cache by @LLee233 in #193
- Add MXFP4 block quant kernel by @Yejing-Lai in #194
- Add Sycl topk per row kernel by @wuxun-zhang in #191
- add eplb enabling kernels by @mayuyuace in #182
- [CI/CD] build wheel in manylinux container by @jikunshang in #174
- add mem_cpy python API by @yma11 in #195
- customized memory allocator for vllm sleep mode by @yma11 in #139
- pin pytorch/manylinux2_28-builder version instead of main by @jikunshang in #203
- [Decode Attn] Change strategy of num_splits to avoid acc issue by @baodii in #204
- Vectorize act-and-mul kernels for speedup by @Liangliang-Ma in #207
- Add test case for indexer_k_quant_and_cache by @LLee233 in #201
- Add pre build wheel install for better development experience by @jikunshang in #188
- format dtype in the test cases by @xinyu-intel in #213
- Add pinned memory test for swap_blocks to verify h2d/d2h transfer with a pinned memory host tensor. by @chaojun-zhang in #198
- [PA] check single element scale by @xinyu-intel in #211
- upgrade sycl-tla to cd76379 by @xinyu-intel in #215
- Refresh readme.md by @rogerxfeng8 in #209
- Fix CI onednn version update by @Yejing-Lai in #226
- [OneDNN] Add w8a16 per channel gemm by @Yejing-Lai in #227
- update vllm kernel benchmark scripts by @1pikachu in #176
- remove xpu_fused_moe weights handling by @mayuyuace in #163
- [Build]estimated compile parallel jobs to avoid OOM by @jikunshang in #219
- [CI]ignore more case on bmg to speed up ci by @jikunshang in #233
- [CHUNK_PREFILL] perf tuning by @YizhouZ in #216
- WA: route XE3/XE3P platforms to XE2 cutlass kernels by @baodii in #240
- support topk topp sampler by @mayuyuace in #228
- Support arbitrary KV cache strides in paged_decode for MLA by @baodii in #165
- Support cp_gather_indexer_k_quant_cache by @LLee233 in #210
- [MLA]add gather_and_maybe_dequant_cache kernel by @jikunshang in #239
- [BUGFIX] modify binding signature of rms_norm kernel by @jikunshang in #246
- Optimize remap kernel of moe by @mayuyuace in #245
- [MLA] add
merge_attn_statessycl kernel by @jikunshang in #64
New Contributors
- @LLee233 made their first contribution in #193
- @wuxun-zhang made their first contribution in #191
- @yma11 made their first contribution in #195
- @1pikachu made their first contribution in #176
Full Changelog: v0.1.4...v0.1.5
v0.1.4 Release
What's Changed
- Add fp8 block quant miniscope by @Yejing-Lai in #175
- [CI] Enable ccache in ci build by @jikunshang in #179
- fix is_causal impact on decode kernel by @baodii in #181
- Add sliding window support for paged decode kernel by @baodii in #168
- Add -Werror by @xinyu-intel in #183
- [Kernel] use different options for diff kernels by @zufangzhu in #186
- Implement swap_blocks kernel with H2D/D2H/D2D support for kv cache offloading by @chaojun-zhang in #157
- Support FP8 KV cache in paged_decode kernel by @baodii in #166
- support qwen3.5 input layout by @mayuyuace in #190
- [Decode Attn] Change strategy of num_splits to avoid acc issue @baodii in #204
Full Changelog: v0.1.3...v0.1.4
v0.1.3 release
What's Changed
- add arch python interface by @jikunshang in #132
- [CI] switch to uv in docker & ci by @jikunshang in #158
- [fmha] align the interface for fp8 kv scale by @xinyu-intel in #150
- Support topk_sigmoid kernel for MoE by @jerrychenhf in #148
- layernorm support uncontiguous by @zufangzhu in #131
- check float fp8 scale by @xinyu-intel in #164
- chunk gdn attention by @mayuyuace in #156
- [Kernel] refactor cache kernel by @zufangzhu in #169
- Tune attention perf to align with IPEX attention functions by @baodii in #162
New Contributors
- @jerrychenhf made their first contribution in #148
Full Changelog: v0.1.2...v0.1.3
v0.1.2 release
What's Changed
- MoE: Optimize and fix moe_align_block_size & moe_lora_align_block_size kernels by @chaojun-zhang in #133
- [CI] add bmg g31 and update docker file by @jikunshang in #144
- [CI] disable time consuming ci and update seed. by @jikunshang in #145
- Add fp8 mxfp8 block quant kernel by @Yejing-Lai in #138
- [sycl-tla] remove unnecessary headers by @xinyu-intel in #129
- init value before atomic in reduction kernel by @xinyu-intel in #149
- [OneDNN] update onednn to 3.11 by @zufangzhu in #143
- [Quant] update fp8 quant kernel by @zufangzhu in #147
New Contributors
- @Yejing-Lai made their first contribution in #138
Full Changelog: v0.1.1...v0.1.2
v0.1.1 release
several fix based on v0.1.1
What's Changed
- [CHUNK_PREFILL] fp8kv cache by @YizhouZ in #128
- [LayerNorm] remove condition by @zufangzhu in #130
- Reduction of the MINI profiler UT execution time by @chaojun-zhang in #134
- add bias for topk softmax by @mayuyuace in #135
- add mini test scope for paged decode attn by @baodii in #136
- add output for fused_moe interface by @jikunshang in #137
Full Changelog: v0.1.0...v0.1.1
v0.1.0
We’re excited to announce the first release of vllm-xpu-kernels!
This release includes migrated and reimplemented core kernels from IPEX. Please note that it is a pre-production release.
What's Changed
- setup build/lint system. add rms_norm as first op to verify functionality by @jikunshang in #1
- refactor files by ops in test folder by @jikunshang in #4
- readme update by @rogerxfeng8 in #5
- [CI] Enable CI by @jikunshang in #2
- use torch-xpu 2.8 link by @jikunshang in #6
- Add cache ops by @zufangzhu in #3
- Add rotary_embedding kernel by @Liangliang-Ma in #7
- Add silu_and_mul kernel by @Liangliang-Ma in #8
- add fp8 quantization kernels by @baodii in #12
- use function instead of lambda by @zufangzhu in #14
- silu_and_mul/rope use functor instead of lambda by @Liangliang-Ma in #15
- Add activation: gelu_fast/gelu_new/gelu_quick by @zhenwei-intel in #16
- feat(moe_sum): add moe_sum by @dbyoung18 in #17
- refactor the structure by @zufangzhu in #20
- use functor for
rms_normkernels by @jikunshang in #18 - add mul_and_silu/gelu_and_mul by @zhenwei-intel in #23
- swigluoai and mul by @mayuyuace in #19
- fix(build): Fix __assert_fail conflict between SYCL and PyTorch by @chaojun-zhang in #26
- feat(mla_cache): add concat_and_cache_mla & gather_cache by @dbyoung18 in #25
- add onednn w8a16 gemm by @zufangzhu in #24
- update onednn extension by @zufangzhu in #29
- update ci script by @jikunshang in #32
- add xpu op grouped topk by @mayuyuace in #27
- grouped topk from IPEX by @mayuyuace in #33
- feat(deepseek_rope): add deepseek_scaling_rope by @dbyoung18 in #34
- add ipex and model config for rmsmorm op by @wendyliu235 in #28
- feat(lora) Add lora bgmv_shrink & bgmv_expand kernels by @chaojun-zhang in #31
- reduce the kernel test coverage for simulator by @chaojun-zhang in #36
- Add chunk_prefill with cutlass backend by @YizhouZ in #38
- topk softmax by @mayuyuace in #39
- [CUTLASS][chunk_prefill] add prefetch and refine barrier scope by @YizhouZ in #44
- [UT][CHUNK_PREFILL]add ut mini scope for fmha by @YizhouZ in #46
- Grouped gemm cutlass by @Liangliang-Ma in #22
- [UT][GroupedGemm] Add grouped gemm pytest mini scope by @Liangliang-Ma in #47
- [FusedMoE] support TP by stream align with device by @Liangliang-Ma in #50
- [OneDNN] Zufang/onednn w4a16 int4 by @zufangzhu in #49
- [CHUNK_PREFILL] fix tiling shape by @YizhouZ in #51
- Reduction of the MINI profiler UT execution time by @chaojun-zhang in #52
- fix install build file lack by @jikunshang in #45
- add op moe_align_block_size & batched_moe_align_block_size by @mayuyuace in #54
- 【GPTQ】add stride setting by @zufangzhu in #57
- [CHUNK_PREFILL] enable sink and local attn by @YizhouZ in #58
- [FusedMoE] input reorder with sycl kernels by @Liangliang-Ma in #56
- Zufang/onednn fp8 gemm by @zufangzhu in #55
- [FusedMoE] Add bias for grouped gemm by @Liangliang-Ma in #62
- [GENERAL] add device index in vllmGetQueue by @YizhouZ in #61
- [Fix] use numel rather than dim, fix w8a8 acc by @zufangzhu in #63
- [Kernel benchmark] enable ipex path for reshape and cache by @DiweiSun in #41
- feat(moe_lora): Add MOE Lora Sum Kernel by @chaojun-zhang in #60
- [FusedMoE] fix an if-else in MoE with bias by @Liangliang-Ma in #65
- [FusedMoe] support fp16 grouped_gemm by @mayuyuace in #59
- [CHUNK_PREFILL] add policy 192 and check conditions by @YizhouZ in #68
- [FusedMoE] fix topk>1 acc issue and support different activation by @Liangliang-Ma in #66
- update clang format, for better indent by @jikunshang in #72
- [OneDNN] add primitive extension and cache locally by @zufangzhu in #73
- [UT] refine miniscope ut shape by @zufangzhu in #75
- [CI] install custom umd by @jikunshang in #77
- [FusedMoE] remove host experts token count to avoid blocking by @Liangliang-Ma in #78
- [CI]fix CI by @jikunshang in #79
- add MINI_PYTEST_PARAMS to test_moe_align_block_size by @mayuyuace in #81
- [CHUNK_PREFILL] kernel refactor using new api by @YizhouZ in #76
- [FusedMoE] refactor to new cutlass api by @Liangliang-Ma in #82
- upgrade torch-xpu 2.9 by @jikunshang in #70
- Support get xpu view from cpu tensor by @chaojun-zhang in #80
- [CHUNK_PREFILL] new api refactor phase 2 by @YizhouZ in #83
- [BUILD] refine CMake, enable AOT by @jikunshang in #89
- [cutlass] support fp8/int4/mxfp4 weights grouped gemm by @mayuyuace in #88
- [CHUNK_PREFILL] new api refactor phase3 by @YizhouZ in #90
- Fix lora accuracy and oom issues by @chaojun-zhang in #91
- [FusedMoE] tune bf16/fp16 grouped gemm grid shape and subgroup shape for decoding perf by @Liangliang-Ma in #96
- apply fused moe fp8/int4/fp4 by @mayuyuace in #98
- [OneDNN] Zufang/wint4aint8 by @zufangzhu in #93
- add contiguous inside rmsnorm kernel by @jikunshang in #95
- Add moe_gather for fused moe by @mayuyuace in #101
- Skip UVA tests for the mini pytest profiler. by @chaojun-zhang in #100
- fix bug of moe_gather by @mayuyuace in #102
- Refine topk softmax for different platform by @mayuyuace in #104
- [CI] clean up docker image in ci by @jikunshang in #99
- rename
fused_moetofused_moe_prelogueby @jikunshang in #105 - [CI]update docker file by @jikunshang in #106
- Reorg sycl-tla kernel code structure by @jikunshang in #103
- Skip fp8 quant large shape test for mini test profiler by @chaojun-zhang in #108
- change grouped gemm kernel fp8 scales to float by @mayuyuace in #110
- add force xe default kernel env var by @jikunshang in #111
- [CI]update pipeline by @jikunshang...