Release v0.1.0 · vllm-project/vllm-xpu-kernels

We’re excited to announce the first release of vllm-xpu-kernels!
This release includes migrated and reimplemented core kernels from IPEX. Please note that it is a pre-production release.

What's Changed

setup build/lint system. add rms_norm as first op to verify functionality by @jikunshang in #1
refactor files by ops in test folder by @jikunshang in #4
readme update by @rogerxfeng8 in #5
[CI] Enable CI by @jikunshang in #2
use torch-xpu 2.8 link by @jikunshang in #6
Add cache ops by @zufangzhu in #3
Add rotary_embedding kernel by @Liangliang-Ma in #7
Add silu_and_mul kernel by @Liangliang-Ma in #8
add fp8 quantization kernels by @baodii in #12
use function instead of lambda by @zufangzhu in #14
silu_and_mul/rope use functor instead of lambda by @Liangliang-Ma in #15
Add activation: gelu_fast/gelu_new/gelu_quick by @zhenwei-intel in #16
feat(moe_sum): add moe_sum by @dbyoung18 in #17
refactor the structure by @zufangzhu in #20
use functor for rms_norm kernels by @jikunshang in #18
add mul_and_silu/gelu_and_mul by @zhenwei-intel in #23
swigluoai and mul by @mayuyuace in #19
fix(build): Fix __assert_fail conflict between SYCL and PyTorch by @chaojun-zhang in #26
feat(mla_cache): add concat_and_cache_mla & gather_cache by @dbyoung18 in #25
add onednn w8a16 gemm by @zufangzhu in #24
update onednn extension by @zufangzhu in #29
update ci script by @jikunshang in #32
add xpu op grouped topk by @mayuyuace in #27
grouped topk from IPEX by @mayuyuace in #33
feat(deepseek_rope): add deepseek_scaling_rope by @dbyoung18 in #34
add ipex and model config for rmsmorm op by @wendyliu235 in #28
feat(lora) Add lora bgmv_shrink & bgmv_expand kernels by @chaojun-zhang in #31
reduce the kernel test coverage for simulator by @chaojun-zhang in #36
Add chunk_prefill with cutlass backend by @YizhouZ in #38
topk softmax by @mayuyuace in #39
[CUTLASS][chunk_prefill] add prefetch and refine barrier scope by @YizhouZ in #44
[UT][CHUNK_PREFILL]add ut mini scope for fmha by @YizhouZ in #46
Grouped gemm cutlass by @Liangliang-Ma in #22
[UT][GroupedGemm] Add grouped gemm pytest mini scope by @Liangliang-Ma in #47
[FusedMoE] support TP by stream align with device by @Liangliang-Ma in #50
[OneDNN] Zufang/onednn w4a16 int4 by @zufangzhu in #49
[CHUNK_PREFILL] fix tiling shape by @YizhouZ in #51
Reduction of the MINI profiler UT execution time by @chaojun-zhang in #52
fix install build file lack by @jikunshang in #45
add op moe_align_block_size & batched_moe_align_block_size by @mayuyuace in #54
【GPTQ】add stride setting by @zufangzhu in #57
[CHUNK_PREFILL] enable sink and local attn by @YizhouZ in #58
[FusedMoE] input reorder with sycl kernels by @Liangliang-Ma in #56
Zufang/onednn fp8 gemm by @zufangzhu in #55
[FusedMoE] Add bias for grouped gemm by @Liangliang-Ma in #62
[GENERAL] add device index in vllmGetQueue by @YizhouZ in #61
[Fix] use numel rather than dim, fix w8a8 acc by @zufangzhu in #63
[Kernel benchmark] enable ipex path for reshape and cache by @DiweiSun in #41
feat(moe_lora): Add MOE Lora Sum Kernel by @chaojun-zhang in #60
[FusedMoE] fix an if-else in MoE with bias by @Liangliang-Ma in #65
[FusedMoe] support fp16 grouped_gemm by @mayuyuace in #59
[CHUNK_PREFILL] add policy 192 and check conditions by @YizhouZ in #68
[FusedMoE] fix topk>1 acc issue and support different activation by @Liangliang-Ma in #66
update clang format, for better indent by @jikunshang in #72
[OneDNN] add primitive extension and cache locally by @zufangzhu in #73
[UT] refine miniscope ut shape by @zufangzhu in #75
[CI] install custom umd by @jikunshang in #77
[FusedMoE] remove host experts token count to avoid blocking by @Liangliang-Ma in #78
[CI]fix CI by @jikunshang in #79
add MINI_PYTEST_PARAMS to test_moe_align_block_size by @mayuyuace in #81
[CHUNK_PREFILL] kernel refactor using new api by @YizhouZ in #76
[FusedMoE] refactor to new cutlass api by @Liangliang-Ma in #82
upgrade torch-xpu 2.9 by @jikunshang in #70
Support get xpu view from cpu tensor by @chaojun-zhang in #80
[CHUNK_PREFILL] new api refactor phase 2 by @YizhouZ in #83
[BUILD] refine CMake, enable AOT by @jikunshang in #89
[cutlass] support fp8/int4/mxfp4 weights grouped gemm by @mayuyuace in #88
[CHUNK_PREFILL] new api refactor phase3 by @YizhouZ in #90
Fix lora accuracy and oom issues by @chaojun-zhang in #91
[FusedMoE] tune bf16/fp16 grouped gemm grid shape and subgroup shape for decoding perf by @Liangliang-Ma in #96
apply fused moe fp8/int4/fp4 by @mayuyuace in #98
[OneDNN] Zufang/wint4aint8 by @zufangzhu in #93
add contiguous inside rmsnorm kernel by @jikunshang in #95
Add moe_gather for fused moe by @mayuyuace in #101
Skip UVA tests for the mini pytest profiler. by @chaojun-zhang in #100
fix bug of moe_gather by @mayuyuace in #102
Refine topk softmax for different platform by @mayuyuace in #104
[CI] clean up docker image in ci by @jikunshang in #99
rename fused_moe to fused_moe_prelogue by @jikunshang in #105
[CI]update docker file by @jikunshang in #106
Reorg sycl-tla kernel code structure by @jikunshang in #103
Skip fp8 quant large shape test for mini test profiler by @chaojun-zhang in #108
change grouped gemm kernel fp8 scales to float by @mayuyuace in #110
add force xe default kernel env var by @jikunshang in #111
[CI]update pipeline by @jikunshang in #107
add quantization into wheel package by @jikunshang in #112
[OneDNN] add vllm own dnnl stream and engine by @zufangzhu in #113
enable xpu_fused_moe ep by @mayuyuace in #114
[CHUNK_PREFILL] add seq_k support by @YizhouZ in #115
[build] Refactor sycl-tla kernel into dynamic library by @jikunshang in #116
add paged decode split kernels for VLLM by @baodii in #123
[FusedMoE] input reorder kernel support low precision inputs with scales by @Liangliang-Ma in #124
support head_group_q more than 8 less than 16 by @baodii in #125
Add weak_ref_tensor support for graph capture by @jikunshang in #121
[OneDNN] gpu only build by @xinyu-intel in #120
[build]split template build by @jikunshang in #126
clean unused utils, add build version by @jikunshang in #122
[lint]format cmake by @jikunshang in #127
upgrade to torch 2.10 & oneapi 2025.3 by @jikunshang in #118

New Contributors

@jikunshang made their first contribution in #1
@rogerxfeng8 made their first contribution in #5
@Liangliang-Ma made their first contribution in #7
@baodii made their first contribution in #12
@zhenwei-intel made their first contribution in #16
@dbyoung18 made their first contribution in #17
@mayuyuace made their first contribution in #19
@chaojun-zhang made their first contribution in #26
@wendyliu235 made their first contribution in #28
@DiweiSun made their first contribution in #41
@xinyu-intel made their first contribution in #120

Full Changelog: https://github.com/vllm-project/vllm-xpu-kernels/commits/v0.1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.1.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

What's Changed

New Contributors

Contributors

Uh oh!