v0.1.0
Pre-release
Pre-release
We’re excited to announce the first release of vllm-xpu-kernels!
This release includes migrated and reimplemented core kernels from IPEX. Please note that it is a pre-production release.
What's Changed
- setup build/lint system. add rms_norm as first op to verify functionality by @jikunshang in #1
- refactor files by ops in test folder by @jikunshang in #4
- readme update by @rogerxfeng8 in #5
- [CI] Enable CI by @jikunshang in #2
- use torch-xpu 2.8 link by @jikunshang in #6
- Add cache ops by @zufangzhu in #3
- Add rotary_embedding kernel by @Liangliang-Ma in #7
- Add silu_and_mul kernel by @Liangliang-Ma in #8
- add fp8 quantization kernels by @baodii in #12
- use function instead of lambda by @zufangzhu in #14
- silu_and_mul/rope use functor instead of lambda by @Liangliang-Ma in #15
- Add activation: gelu_fast/gelu_new/gelu_quick by @zhenwei-intel in #16
- feat(moe_sum): add moe_sum by @dbyoung18 in #17
- refactor the structure by @zufangzhu in #20
- use functor for
rms_normkernels by @jikunshang in #18 - add mul_and_silu/gelu_and_mul by @zhenwei-intel in #23
- swigluoai and mul by @mayuyuace in #19
- fix(build): Fix __assert_fail conflict between SYCL and PyTorch by @chaojun-zhang in #26
- feat(mla_cache): add concat_and_cache_mla & gather_cache by @dbyoung18 in #25
- add onednn w8a16 gemm by @zufangzhu in #24
- update onednn extension by @zufangzhu in #29
- update ci script by @jikunshang in #32
- add xpu op grouped topk by @mayuyuace in #27
- grouped topk from IPEX by @mayuyuace in #33
- feat(deepseek_rope): add deepseek_scaling_rope by @dbyoung18 in #34
- add ipex and model config for rmsmorm op by @wendyliu235 in #28
- feat(lora) Add lora bgmv_shrink & bgmv_expand kernels by @chaojun-zhang in #31
- reduce the kernel test coverage for simulator by @chaojun-zhang in #36
- Add chunk_prefill with cutlass backend by @YizhouZ in #38
- topk softmax by @mayuyuace in #39
- [CUTLASS][chunk_prefill] add prefetch and refine barrier scope by @YizhouZ in #44
- [UT][CHUNK_PREFILL]add ut mini scope for fmha by @YizhouZ in #46
- Grouped gemm cutlass by @Liangliang-Ma in #22
- [UT][GroupedGemm] Add grouped gemm pytest mini scope by @Liangliang-Ma in #47
- [FusedMoE] support TP by stream align with device by @Liangliang-Ma in #50
- [OneDNN] Zufang/onednn w4a16 int4 by @zufangzhu in #49
- [CHUNK_PREFILL] fix tiling shape by @YizhouZ in #51
- Reduction of the MINI profiler UT execution time by @chaojun-zhang in #52
- fix install build file lack by @jikunshang in #45
- add op moe_align_block_size & batched_moe_align_block_size by @mayuyuace in #54
- 【GPTQ】add stride setting by @zufangzhu in #57
- [CHUNK_PREFILL] enable sink and local attn by @YizhouZ in #58
- [FusedMoE] input reorder with sycl kernels by @Liangliang-Ma in #56
- Zufang/onednn fp8 gemm by @zufangzhu in #55
- [FusedMoE] Add bias for grouped gemm by @Liangliang-Ma in #62
- [GENERAL] add device index in vllmGetQueue by @YizhouZ in #61
- [Fix] use numel rather than dim, fix w8a8 acc by @zufangzhu in #63
- [Kernel benchmark] enable ipex path for reshape and cache by @DiweiSun in #41
- feat(moe_lora): Add MOE Lora Sum Kernel by @chaojun-zhang in #60
- [FusedMoE] fix an if-else in MoE with bias by @Liangliang-Ma in #65
- [FusedMoe] support fp16 grouped_gemm by @mayuyuace in #59
- [CHUNK_PREFILL] add policy 192 and check conditions by @YizhouZ in #68
- [FusedMoE] fix topk>1 acc issue and support different activation by @Liangliang-Ma in #66
- update clang format, for better indent by @jikunshang in #72
- [OneDNN] add primitive extension and cache locally by @zufangzhu in #73
- [UT] refine miniscope ut shape by @zufangzhu in #75
- [CI] install custom umd by @jikunshang in #77
- [FusedMoE] remove host experts token count to avoid blocking by @Liangliang-Ma in #78
- [CI]fix CI by @jikunshang in #79
- add MINI_PYTEST_PARAMS to test_moe_align_block_size by @mayuyuace in #81
- [CHUNK_PREFILL] kernel refactor using new api by @YizhouZ in #76
- [FusedMoE] refactor to new cutlass api by @Liangliang-Ma in #82
- upgrade torch-xpu 2.9 by @jikunshang in #70
- Support get xpu view from cpu tensor by @chaojun-zhang in #80
- [CHUNK_PREFILL] new api refactor phase 2 by @YizhouZ in #83
- [BUILD] refine CMake, enable AOT by @jikunshang in #89
- [cutlass] support fp8/int4/mxfp4 weights grouped gemm by @mayuyuace in #88
- [CHUNK_PREFILL] new api refactor phase3 by @YizhouZ in #90
- Fix lora accuracy and oom issues by @chaojun-zhang in #91
- [FusedMoE] tune bf16/fp16 grouped gemm grid shape and subgroup shape for decoding perf by @Liangliang-Ma in #96
- apply fused moe fp8/int4/fp4 by @mayuyuace in #98
- [OneDNN] Zufang/wint4aint8 by @zufangzhu in #93
- add contiguous inside rmsnorm kernel by @jikunshang in #95
- Add moe_gather for fused moe by @mayuyuace in #101
- Skip UVA tests for the mini pytest profiler. by @chaojun-zhang in #100
- fix bug of moe_gather by @mayuyuace in #102
- Refine topk softmax for different platform by @mayuyuace in #104
- [CI] clean up docker image in ci by @jikunshang in #99
- rename
fused_moetofused_moe_prelogueby @jikunshang in #105 - [CI]update docker file by @jikunshang in #106
- Reorg sycl-tla kernel code structure by @jikunshang in #103
- Skip fp8 quant large shape test for mini test profiler by @chaojun-zhang in #108
- change grouped gemm kernel fp8 scales to float by @mayuyuace in #110
- add force xe default kernel env var by @jikunshang in #111
- [CI]update pipeline by @jikunshang in #107
- add quantization into wheel package by @jikunshang in #112
- [OneDNN] add vllm own dnnl stream and engine by @zufangzhu in #113
- enable xpu_fused_moe ep by @mayuyuace in #114
- [CHUNK_PREFILL] add seq_k support by @YizhouZ in #115
- [build] Refactor sycl-tla kernel into dynamic library by @jikunshang in #116
- add paged decode split kernels for VLLM by @baodii in #123
- [FusedMoE] input reorder kernel support low precision inputs with scales by @Liangliang-Ma in #124
- support head_group_q more than 8 less than 16 by @baodii in #125
- Add
weak_ref_tensorsupport for graph capture by @jikunshang in #121 - [OneDNN] gpu only build by @xinyu-intel in #120
- [build]split template build by @jikunshang in #126
- clean unused utils, add build version by @jikunshang in #122
- [lint]format cmake by @jikunshang in #127
- upgrade to torch 2.10 & oneapi 2025.3 by @jikunshang in #118
New Contributors
- @jikunshang made their first contribution in #1
- @rogerxfeng8 made their first contribution in #5
- @Liangliang-Ma made their first contribution in #7
- @baodii made their first contribution in #12
- @zhenwei-intel made their first contribution in #16
- @dbyoung18 made their first contribution in #17
- @mayuyuace made their first contribution in #19
- @chaojun-zhang made their first contribution in #26
- @wendyliu235 made their first contribution in #28
- @DiweiSun made their first contribution in #41
- @xinyu-intel made their first contribution in #120
Full Changelog: https://github.com/vllm-project/vllm-xpu-kernels/commits/v0.1.0