20 Apr 05:38

jikunshang

e095a4a

v0.1.6 release Latest

Latest

Highlights

Fused Kernels for torch.compile: New fuse_norm_quant, fuse_act_quant, and fused_qk_norm_rope kernels enabling norm-quantization and activation-quantization fusion under torch.compile.
MXFP8 / MXFP4 oneDNN GEMM: Added microscaling FP8 and FP4 GEMM support via oneDNN, broadening low-precision inference capabilities.
Flash Attention head_dim=512: Extended flash-MHA support to head dimension 512.
Chunked Prefill dynamic stride: Added dynamic stride support for chunk prefill attention, improving flexibility for variable-length workloads.

New Features

Attention

[FMHA] Support head dimension 512 (#251) — Extends flash attention to models using 512-dim heads.
[Chunk Prefill] Add dynamic stride support (#187) — Enables dynamic stride in chunked prefill for variable-length input sequences.
[Decode Attention] Tune num_kv_splits for paged decode kernel (#257) — Improves decode-stage attention performance via better KV-split tuning.

Activation

Add fatrelu_and_mul (#259) — New FATReLU fused activation kernel.
Support relu2_no_mul (SYCL) for Nemotron-3-Nano-30B-A3B-bf16 (#232).
Support swiglustep_and_mul for Step-3.5-Flash (#199) — New SwiGLU-Step fused activation variant.

Quantization & Low-Precision

[OneDNN] Add MXFP8 and MXFP4 GEMM (#235) — Microscaling FP8/FP4 GEMM via oneDNN for next-gen low-precision inference.

Fusion (torch.compile)

Add fuse_norm_quant, fuse_act_quant, and fused_qk_norm_rope kernels (#267) — Fused normalization+quantization and QK-norm+RoPE kernels, registered as custom ops compatible with torch.compile.

MoE (Mixture of Experts)

Add topk=10 for remap_hidden_states kernel (#273) — Extends remap kernel to support topk=10 routing.
Optimize MoE GEMM (#266) — Performance improvements for MoE grouped GEMM.

Cache / Memory

Add swap_blocks_batch op with batched async memcpy (#265) — New batched block-swap operation using asynchronous memory copy for improved KV-cache management.

LoRA

Add mixed precision support for LoRA expand & shrink kernels (#230) — Enables mixed-precision (e.g., bf16/fp32) LoRA adapters.

Model Support

Support f32 ssm_state in GDN kernel for Qwen3.5 (#220) — Enables fp32 SSM state for Qwen3.5 Mamba-style models.
Follow upstream to change A_log dtype to fp32 for Qwen3.5 (#254).

Bug Fixes

Fix overflow of remap_hidden_states when rows is huge (#269) — Resolved integer overflow for large batch/sequence scenarios.
Fix XPU CPU-view tensor lifetime (#262) — Fixed use-after-free issue with CPU-view tensors on XPU.
Skip scales check (#256) — Fixed spurious validation failure in quantized GEMM path.
Skip GDN core_attn_out check for 8k length due to random numeric error (#264, #261).

Performance

Optimize MoE GEMM (#266) — Tuned grouped GEMM policies for better throughput.
Tune num_kv_splits for paged decode kernel (#257) — Improved decode attention latency.

Infrastructure & Build

Refactor CMake to enable selective kernel build (#260) — Allows building only a subset of kernels, reducing compile time and binary size.
Upgrade oneDNN to v3.11.2 (#248).
Use local LRU cache for oneDNN primitive caching (#275).
Show binary size in pre-CI (#268).
Update SCM version check and project Python version (#274).
Remove yapf (#272) — Dropped yapf formatter from the project.
Add psutil to pyproject.toml (#255).
Fix MoE benchmark (#279).

Testing

Refine test scope definition (#250) — Improved test profiling and scope control framework.
New tests: test_fused_norm_quant, test_fused_qk_norm_rope, test_fused_quant_activation, test_swiglustep_and_mul, test_fp4_gemm_onednn, test_cache (swap_blocks_batch), test_lora_ops (mixed precision).

Contributors

Thanks to all 11 contributors for this release:

Kunshang Ji, Xinyu Chen, Qiming Zhang, Qun Yang, Chaojun Zhang, Zofia, Zhefeng Qiao, Yejing Lai, Yizhou Wang, Liuzhenwei, Baodi

What's Changed

upgrade onednn to v3.11.2 by @zufangzhu in #248
Support f32 ssm_state in GDN kernel for Qwen3.5 by @YangQun1 in #220
Add mixed precision support for LoRA expand & shrink kernels by @chaojun-zhang in #230
Support swiglustep and mul for Step-3.5-Flash by @Dboyqiao in #199
[build]add psutil in pyproject.toml by @jikunshang in #255
skip scales check by @mayuyuace in #256
Support sycl impl relu2_no_mul for NVIDIA-Nemotron-3-Nano-30B-A3B-bf16 by @Dboyqiao in #232
Change gdn attn A_log dtype to fp32 for qwen3.5 by @YangQun1 in #254
[OneDNN] add mxfp8, mxfp4 onednn gemm by @zufangzhu in #235
skip gdn core_attn_out check for f32 ssm_state+8k len due to random numeric error by @YangQun1 in #261
skip gdn core_attn_out check for 8k len due to random numeric error by @YangQun1 in #264
[Decode attn] tune num_kv_splits for page decode kernel by @baodii in #257
[fmha] support head dim 512 by @xinyu-intel in #251
refactor cmake to enable selective kernel build by @xinyu-intel in #260
Optimize Moe GEMM by @mayuyuace in #266
Fix overflow of remap_hidden_states when rows is huge by @mayuyuace in #269
Add swap_blocks_batch op with batched async memcpy by @chaojun-zhang in #265
[Test] refine test socpe definition by @jikunshang in #250
Fix use-after-free in get_xpu_view_from_cpu_tensor by @chaojun-zhang in #262
[Fusion][Torch.compiler] Add fuse_norm_quant, fuse_act_quant and fused_qk_norm_rope kernel by @Yejing-Lai in #267
[Build][Lint] remove yapf by @jikunshang in #272
[OneDNN] use local lru by @zufangzhu in #275
[Build] update scm version check and project python version by @jikunshang in #274
[CHUNK_PREFILL] add dynamic_stride support by @YizhouZ in #187
add fatrelu_and_mul by @zhenwei-intel in #259
fix moe benchmark by @xinyu-intel in #279
show binary size in pre-ci by @xinyu-intel in #268
remap hidden status topk 10 by @mayuyuace in #273

New Contributors

@YangQun1 made their first contribution in #220
@Dboyqiao made their first contribution in #199

Full Changelog: v0.1.5...v0.1.6

Contributors

Dboyqiao, chaojun-zhang, and 9 other contributors

Assets 3

03 Apr 04:01

jikunshang

v0.1.5

1683b76

v0.1.5 release

v0.1.5 Release Notes

This release delivers major kernel and runtime updates for Intel XPU, with a focus on MLA path coverage, quantization support, MoE performance, and CI/build stability.

Highlights

Added MLA kernels:
- merge_attn_states
- gather_and_maybe_dequant_cache
Improved MLA decode flexibility with support for arbitrary KV cache strides in paged decode.
Added/extended quantization and cache kernels:
- FP8 w8a16 GEMM
- MXFP4 block quant kernel
- indexer_k_quant_and_cache and cp_gather_indexer_k_quant_cache
Added new kernels/features:
- SYCL topk_per_row
- topk_topp sampler
- EPLB enabling kernels
Performance and optimization updates:
- MoE remap kernel optimization
- Chunk prefill tuning
- Vectorized act-and-mul kernels
Runtime and API improvements:
- Customized memory allocator for vLLM sleep mode
- Added mem_cpy Python API

Fixes

Bugfix: updated binding signature for rms_norm.
Decode attention: adjusted num_splits strategy to avoid accuracy issues.
Platform workaround: route XE3/XE3P to XE2 CUTLASS kernels.
CI/build fixes:
- oneDNN version compatibility fix
- manylinux builder pinning
- improved job estimation to reduce OOM risk

Developer Experience

Added prebuilt wheel install path for faster development setup.
Added/updated tests for pinned-memory swap blocks and indexer_k_quant_and_cache.
Refreshed benchmark coverage for flash attention and fused MoE.
Upgraded SYCL-TLA dependency revision.

Potentially Breaking / Behavior Changes

Removed xpu_fused_moe weights handling; downstream integrations relying on previous behavior should verify compatibility.

Included PRs (since v0.1.4)

#64, #139, #163, #165, #174, #176, #182, #188, #191, #193, #194, #195, #198, #201, #203, #204, #207, #209, #210, #211, #213, #215, #216, #219, #226, #227, #228, #233, #239, #240, #245, #246

What's Changed

Support indexer_k_quant_and_cache by @LLee233 in #193
Add MXFP4 block quant kernel by @Yejing-Lai in #194
Add Sycl topk per row kernel by @wuxun-zhang in #191
add eplb enabling kernels by @mayuyuace in #182
[CI/CD] build wheel in manylinux container by @jikunshang in #174
add mem_cpy python API by @yma11 in #195
customized memory allocator for vllm sleep mode by @yma11 in #139
pin pytorch/manylinux2_28-builder version instead of main by @jikunshang in #203
[Decode Attn] Change strategy of num_splits to avoid acc issue by @baodii in #204
Vectorize act-and-mul kernels for speedup by @Liangliang-Ma in #207
Add test case for indexer_k_quant_and_cache by @LLee233 in #201
Add pre build wheel install for better development experience by @jikunshang in #188
format dtype in the test cases by @xinyu-intel in #213
Add pinned memory test for swap_blocks to verify h2d/d2h transfer with a pinned memory host tensor. by @chaojun-zhang in #198
[PA] check single element scale by @xinyu-intel in #211
upgrade sycl-tla to cd76379 by @xinyu-intel in #215
Refresh readme.md by @rogerxfeng8 in #209
Fix CI onednn version update by @Yejing-Lai in #226
[OneDNN] Add w8a16 per channel gemm by @Yejing-Lai in #227
update vllm kernel benchmark scripts by @1pikachu in #176
remove xpu_fused_moe weights handling by @mayuyuace in #163
[Build]estimated compile parallel jobs to avoid OOM by @jikunshang in #219
[CI]ignore more case on bmg to speed up ci by @jikunshang in #233
[CHUNK_PREFILL] perf tuning by @YizhouZ in #216
WA: route XE3/XE3P platforms to XE2 cutlass kernels by @baodii in #240
support topk topp sampler by @mayuyuace in #228
Support arbitrary KV cache strides in paged_decode for MLA by @baodii in #165
Support cp_gather_indexer_k_quant_cache by @LLee233 in #210
[MLA]add gather_and_maybe_dequant_cache kernel by @jikunshang in #239
[BUGFIX] modify binding signature of rms_norm kernel by @jikunshang in #246
Optimize remap kernel of moe by @mayuyuace in #245
[MLA] add merge_attn_states sycl kernel by @jikunshang in #64

New Contributors

@LLee233 made their first contribution in #193
@wuxun-zhang made their first contribution in #191
@yma11 made their first contribution in #195
@1pikachu made their first contribution in #176

Full Changelog: v0.1.4...v0.1.5

Contributors

chaojun-zhang, yma11, and 11 other contributors

Assets 3

20 Mar 04:56

jikunshang

v0.1.4

d3dea75

v0.1.4 Release

What's Changed

Add fp8 block quant miniscope by @Yejing-Lai in #175
[CI] Enable ccache in ci build by @jikunshang in #179
fix is_causal impact on decode kernel by @baodii in #181
Add sliding window support for paged decode kernel by @baodii in #168
Add -Werror by @xinyu-intel in #183
[Kernel] use different options for diff kernels by @zufangzhu in #186
Implement swap_blocks kernel with H2D/D2H/D2D support for kv cache offloading by @chaojun-zhang in #157
Support FP8 KV cache in paged_decode kernel by @baodii in #166
support qwen3.5 input layout by @mayuyuace in #190
[Decode Attn] Change strategy of num_splits to avoid acc issue @baodii in #204

Full Changelog: v0.1.3...v0.1.4

Contributors

chaojun-zhang, xinyu-intel, and 5 other contributors

Assets 3

04 Mar 01:59

jikunshang

v0.1.3

9c8616f

v0.1.3 release

What's Changed

add arch python interface by @jikunshang in #132
[CI] switch to uv in docker & ci by @jikunshang in #158
[fmha] align the interface for fp8 kv scale by @xinyu-intel in #150
Support topk_sigmoid kernel for MoE by @jerrychenhf in #148
layernorm support uncontiguous by @zufangzhu in #131
check float fp8 scale by @xinyu-intel in #164
chunk gdn attention by @mayuyuace in #156
[Kernel] refactor cache kernel by @zufangzhu in #169
Tune attention perf to align with IPEX attention functions by @baodii in #162

New Contributors

@jerrychenhf made their first contribution in #148

Full Changelog: v0.1.2...v0.1.3

Contributors

jerrychenhf, xinyu-intel, and 4 other contributors

Assets 3

11 Feb 00:04

jikunshang

v0.1.2

e7dee22

v0.1.2 release

What's Changed

MoE: Optimize and fix moe_align_block_size & moe_lora_align_block_size kernels by @chaojun-zhang in #133
[CI] add bmg g31 and update docker file by @jikunshang in #144
[CI] disable time consuming ci and update seed. by @jikunshang in #145
Add fp8 mxfp8 block quant kernel by @Yejing-Lai in #138
[sycl-tla] remove unnecessary headers by @xinyu-intel in #129
init value before atomic in reduction kernel by @xinyu-intel in #149
[OneDNN] update onednn to 3.11 by @zufangzhu in #143
[Quant] update fp8 quant kernel by @zufangzhu in #147

New Contributors

@Yejing-Lai made their first contribution in #138

Full Changelog: v0.1.1...v0.1.2

Contributors

chaojun-zhang, xinyu-intel, and 3 other contributors

Assets 3

03 Feb 04:55

jikunshang

v0.1.1

b38d248

v0.1.1 release Pre-release

Pre-release

several fix based on v0.1.1

What's Changed

[CHUNK_PREFILL] fp8kv cache by @YizhouZ in #128
[LayerNorm] remove condition by @zufangzhu in #130
Reduction of the MINI profiler UT execution time by @chaojun-zhang in #134
add bias for topk softmax by @mayuyuace in #135
add mini test scope for paged decode attn by @baodii in #136
add output for fused_moe interface by @jikunshang in #137

Full Changelog: v0.1.0...v0.1.1

Contributors

chaojun-zhang, jikunshang, and 4 other contributors

Assets 3

29 Jan 09:08

jikunshang

v0.1.0

b505e23

v0.1.0 Pre-release

Pre-release

We’re excited to announce the first release of vllm-xpu-kernels!
This release includes migrated and reimplemented core kernels from IPEX. Please note that it is a pre-production release.

What's Changed

setup build/lint system. add rms_norm as first op to verify functionality by @jikunshang in #1
refactor files by ops in test folder by @jikunshang in #4
readme update by @rogerxfeng8 in #5
[CI] Enable CI by @jikunshang in #2
use torch-xpu 2.8 link by @jikunshang in #6
Add cache ops by @zufangzhu in #3
Add rotary_embedding kernel by @Liangliang-Ma in #7
Add silu_and_mul kernel by @Liangliang-Ma in #8
add fp8 quantization kernels by @baodii in #12
use function instead of lambda by @zufangzhu in #14
silu_and_mul/rope use functor instead of lambda by @Liangliang-Ma in #15
Add activation: gelu_fast/gelu_new/gelu_quick by @zhenwei-intel in #16
feat(moe_sum): add moe_sum by @dbyoung18 in #17
refactor the structure by @zufangzhu in #20
use functor for rms_norm kernels by @jikunshang in #18
add mul_and_silu/gelu_and_mul by @zhenwei-intel in #23
swigluoai and mul by @mayuyuace in #19
fix(build): Fix __assert_fail conflict between SYCL and PyTorch by @chaojun-zhang in #26
feat(mla_cache): add concat_and_cache_mla & gather_cache by @dbyoung18 in #25
add onednn w8a16 gemm by @zufangzhu in #24
update onednn extension by @zufangzhu in #29
update ci script by @jikunshang in #32
add xpu op grouped topk by @mayuyuace in #27
grouped topk from IPEX by @mayuyuace in #33
feat(deepseek_rope): add deepseek_scaling_rope by @dbyoung18 in #34
add ipex and model config for rmsmorm op by @wendyliu235 in #28
feat(lora) Add lora bgmv_shrink & bgmv_expand kernels by @chaojun-zhang in #31
reduce the kernel test coverage for simulator by @chaojun-zhang in #36
Add chunk_prefill with cutlass backend by @YizhouZ in #38
topk softmax by @mayuyuace in #39
[CUTLASS][chunk_prefill] add prefetch and refine barrier scope by @YizhouZ in #44
[UT][CHUNK_PREFILL]add ut mini scope for fmha by @YizhouZ in #46
Grouped gemm cutlass by @Liangliang-Ma in #22
[UT][GroupedGemm] Add grouped gemm pytest mini scope by @Liangliang-Ma in #47
[FusedMoE] support TP by stream align with device by @Liangliang-Ma in #50
[OneDNN] Zufang/onednn w4a16 int4 by @zufangzhu in #49
[CHUNK_PREFILL] fix tiling shape by @YizhouZ in #51
Reduction of the MINI profiler UT execution time by @chaojun-zhang in #52
fix install build file lack by @jikunshang in #45
add op moe_align_block_size & batched_moe_align_block_size by @mayuyuace in #54
【GPTQ】add stride setting by @zufangzhu in #57
[CHUNK_PREFILL] enable sink and local attn by @YizhouZ in #58
[FusedMoE] input reorder with sycl kernels by @Liangliang-Ma in #56
Zufang/onednn fp8 gemm by @zufangzhu in #55
[FusedMoE] Add bias for grouped gemm by @Liangliang-Ma in #62
[GENERAL] add device index in vllmGetQueue by @YizhouZ in #61
[Fix] use numel rather than dim, fix w8a8 acc by @zufangzhu in #63
[Kernel benchmark] enable ipex path for reshape and cache by @DiweiSun in #41
feat(moe_lora): Add MOE Lora Sum Kernel by @chaojun-zhang in #60
[FusedMoE] fix an if-else in MoE with bias by @Liangliang-Ma in #65
[FusedMoe] support fp16 grouped_gemm by @mayuyuace in #59
[CHUNK_PREFILL] add policy 192 and check conditions by @YizhouZ in #68
[FusedMoE] fix topk>1 acc issue and support different activation by @Liangliang-Ma in #66
update clang format, for better indent by @jikunshang in #72
[OneDNN] add primitive extension and cache locally by @zufangzhu in #73
[UT] refine miniscope ut shape by @zufangzhu in #75
[CI] install custom umd by @jikunshang in #77
[FusedMoE] remove host experts token count to avoid blocking by @Liangliang-Ma in #78
[CI]fix CI by @jikunshang in #79
add MINI_PYTEST_PARAMS to test_moe_align_block_size by @mayuyuace in #81
[CHUNK_PREFILL] kernel refactor using new api by @YizhouZ in #76
[FusedMoE] refactor to new cutlass api by @Liangliang-Ma in #82
upgrade torch-xpu 2.9 by @jikunshang in #70
Support get xpu view from cpu tensor by @chaojun-zhang in #80
[CHUNK_PREFILL] new api refactor phase 2 by @YizhouZ in #83
[BUILD] refine CMake, enable AOT by @jikunshang in #89
[cutlass] support fp8/int4/mxfp4 weights grouped gemm by @mayuyuace in #88
[CHUNK_PREFILL] new api refactor phase3 by @YizhouZ in #90
Fix lora accuracy and oom issues by @chaojun-zhang in #91
[FusedMoE] tune bf16/fp16 grouped gemm grid shape and subgroup shape for decoding perf by @Liangliang-Ma in #96
apply fused moe fp8/int4/fp4 by @mayuyuace in #98
[OneDNN] Zufang/wint4aint8 by @zufangzhu in #93
add contiguous inside rmsnorm kernel by @jikunshang in #95
Add moe_gather for fused moe by @mayuyuace in #101
Skip UVA tests for the mini pytest profiler. by @chaojun-zhang in #100
fix bug of moe_gather by @mayuyuace in #102
Refine topk softmax for different platform by @mayuyuace in #104
[CI] clean up docker image in ci by @jikunshang in #99
rename fused_moe to fused_moe_prelogue by @jikunshang in #105
[CI]update docker file by @jikunshang in #106
Reorg sycl-tla kernel code structure by @jikunshang in #103
Skip fp8 quant large shape test for mini test profiler by @chaojun-zhang in #108
change grouped gemm kernel fp8 scales to float by @mayuyuace in #110
add force xe default kernel env var by @jikunshang in #111
[CI]update pipeline by @jikunshang...

Contributors

chaojun-zhang, Liangliang-Ma, and 11 other contributors

Assets 3

Releases: vllm-project/vllm-xpu-kernels

v0.1.6 release

Highlights

New Features

Attention

Activation

Quantization & Low-Precision

Fusion (torch.compile)

MoE (Mixture of Experts)

Cache / Memory

LoRA

Model Support

Bug Fixes

Performance

Infrastructure & Build

Testing

Contributors

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.5 release

v0.1.5 Release Notes

Highlights

Fixes

Developer Experience

Potentially Breaking / Behavior Changes

Included PRs (since v0.1.4)

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.4 Release

What's Changed

Contributors

Uh oh!

v0.1.3 release

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.2 release

What's Changed

New Contributors

Contributors

Uh oh!

v0.1.1 release

What's Changed

Contributors

Uh oh!

v0.1.0

What's Changed

Contributors

Uh oh!