Highlights
- Fused Kernels for torch.compile: New
fuse_norm_quant,fuse_act_quant, andfused_qk_norm_ropekernels enabling norm-quantization and activation-quantization fusion undertorch.compile. - MXFP8 / MXFP4 oneDNN GEMM: Added microscaling FP8 and FP4 GEMM support via oneDNN, broadening low-precision inference capabilities.
- Flash Attention head_dim=512: Extended flash-MHA support to head dimension 512.
- Chunked Prefill dynamic stride: Added dynamic stride support for chunk prefill attention, improving flexibility for variable-length workloads.
New Features
Attention
- [FMHA] Support head dimension 512 (#251) — Extends flash attention to models using 512-dim heads.
- [Chunk Prefill] Add dynamic stride support (#187) — Enables dynamic stride in chunked prefill for variable-length input sequences.
- [Decode Attention] Tune
num_kv_splitsfor paged decode kernel (#257) — Improves decode-stage attention performance via better KV-split tuning.
Activation
- Add
fatrelu_and_mul(#259) — New FATReLU fused activation kernel. - Support
relu2_no_mul(SYCL) for Nemotron-3-Nano-30B-A3B-bf16 (#232). - Support
swiglustep_and_mulfor Step-3.5-Flash (#199) — New SwiGLU-Step fused activation variant.
Quantization & Low-Precision
- [OneDNN] Add MXFP8 and MXFP4 GEMM (#235) — Microscaling FP8/FP4 GEMM via oneDNN for next-gen low-precision inference.
Fusion (torch.compile)
- Add
fuse_norm_quant,fuse_act_quant, andfused_qk_norm_ropekernels (#267) — Fused normalization+quantization and QK-norm+RoPE kernels, registered as custom ops compatible withtorch.compile.
MoE (Mixture of Experts)
- Add
topk=10forremap_hidden_stateskernel (#273) — Extends remap kernel to support topk=10 routing. - Optimize MoE GEMM (#266) — Performance improvements for MoE grouped GEMM.
Cache / Memory
- Add
swap_blocks_batchop with batched async memcpy (#265) — New batched block-swap operation using asynchronous memory copy for improved KV-cache management.
LoRA
- Add mixed precision support for LoRA expand & shrink kernels (#230) — Enables mixed-precision (e.g., bf16/fp32) LoRA adapters.
Model Support
- Support f32
ssm_statein GDN kernel for Qwen3.5 (#220) — Enables fp32 SSM state for Qwen3.5 Mamba-style models. - Follow upstream to change
A_logdtype to fp32 for Qwen3.5 (#254).
Bug Fixes
- Fix overflow of
remap_hidden_stateswhen rows is huge (#269) — Resolved integer overflow for large batch/sequence scenarios. - Fix XPU CPU-view tensor lifetime (#262) — Fixed use-after-free issue with CPU-view tensors on XPU.
- Skip scales check (#256) — Fixed spurious validation failure in quantized GEMM path.
- Skip GDN
core_attn_outcheck for 8k length due to random numeric error (#264, #261).
Performance
- Optimize MoE GEMM (#266) — Tuned grouped GEMM policies for better throughput.
- Tune
num_kv_splitsfor paged decode kernel (#257) — Improved decode attention latency.
Infrastructure & Build
- Refactor CMake to enable selective kernel build (#260) — Allows building only a subset of kernels, reducing compile time and binary size.
- Upgrade oneDNN to v3.11.2 (#248).
- Use local LRU cache for oneDNN primitive caching (#275).
- Show binary size in pre-CI (#268).
- Update SCM version check and project Python version (#274).
- Remove yapf (#272) — Dropped yapf formatter from the project.
- Add
psutiltopyproject.toml(#255). - Fix MoE benchmark (#279).
Testing
- Refine test scope definition (#250) — Improved test profiling and scope control framework.
- New tests:
test_fused_norm_quant,test_fused_qk_norm_rope,test_fused_quant_activation,test_swiglustep_and_mul,test_fp4_gemm_onednn,test_cache(swap_blocks_batch),test_lora_ops(mixed precision).
Contributors
Thanks to all 11 contributors for this release:
Kunshang Ji, Xinyu Chen, Qiming Zhang, Qun Yang, Chaojun Zhang, Zofia, Zhefeng Qiao, Yejing Lai, Yizhou Wang, Liuzhenwei, Baodi
What's Changed
- upgrade onednn to v3.11.2 by @zufangzhu in #248
- Support f32 ssm_state in GDN kernel for Qwen3.5 by @YangQun1 in #220
- Add mixed precision support for LoRA expand & shrink kernels by @chaojun-zhang in #230
- Support swiglustep and mul for Step-3.5-Flash by @Dboyqiao in #199
- [build]add psutil in pyproject.toml by @jikunshang in #255
- skip scales check by @mayuyuace in #256
- Support sycl impl relu2_no_mul for NVIDIA-Nemotron-3-Nano-30B-A3B-bf16 by @Dboyqiao in #232
- Change gdn attn A_log dtype to fp32 for qwen3.5 by @YangQun1 in #254
- [OneDNN] add mxfp8, mxfp4 onednn gemm by @zufangzhu in #235
- skip gdn core_attn_out check for f32 ssm_state+8k len due to random numeric error by @YangQun1 in #261
- skip gdn core_attn_out check for 8k len due to random numeric error by @YangQun1 in #264
- [Decode attn] tune num_kv_splits for page decode kernel by @baodii in #257
- [fmha] support head dim 512 by @xinyu-intel in #251
- refactor cmake to enable selective kernel build by @xinyu-intel in #260
- Optimize Moe GEMM by @mayuyuace in #266
- Fix overflow of remap_hidden_states when rows is huge by @mayuyuace in #269
- Add swap_blocks_batch op with batched async memcpy by @chaojun-zhang in #265
- [Test] refine test socpe definition by @jikunshang in #250
- Fix use-after-free in get_xpu_view_from_cpu_tensor by @chaojun-zhang in #262
- [Fusion][Torch.compiler] Add fuse_norm_quant, fuse_act_quant and fused_qk_norm_rope kernel by @Yejing-Lai in #267
- [Build][Lint] remove yapf by @jikunshang in #272
- [OneDNN] use local lru by @zufangzhu in #275
- [Build] update scm version check and project python version by @jikunshang in #274
- [CHUNK_PREFILL] add dynamic_stride support by @YizhouZ in #187
- add fatrelu_and_mul by @zhenwei-intel in #259
- fix moe benchmark by @xinyu-intel in #279
- show binary size in pre-ci by @xinyu-intel in #268
- remap hidden status topk 10 by @mayuyuace in #273
New Contributors
Full Changelog: v0.1.5...v0.1.6