Skip to content

v0.1.6 release

Latest

Choose a tag to compare

@jikunshang jikunshang released this 20 Apr 05:38
e095a4a

Highlights

  • Fused Kernels for torch.compile: New fuse_norm_quant, fuse_act_quant, and fused_qk_norm_rope kernels enabling norm-quantization and activation-quantization fusion under torch.compile.
  • MXFP8 / MXFP4 oneDNN GEMM: Added microscaling FP8 and FP4 GEMM support via oneDNN, broadening low-precision inference capabilities.
  • Flash Attention head_dim=512: Extended flash-MHA support to head dimension 512.
  • Chunked Prefill dynamic stride: Added dynamic stride support for chunk prefill attention, improving flexibility for variable-length workloads.

New Features

Attention

  • [FMHA] Support head dimension 512 (#251) — Extends flash attention to models using 512-dim heads.
  • [Chunk Prefill] Add dynamic stride support (#187) — Enables dynamic stride in chunked prefill for variable-length input sequences.
  • [Decode Attention] Tune num_kv_splits for paged decode kernel (#257) — Improves decode-stage attention performance via better KV-split tuning.

Activation

  • Add fatrelu_and_mul (#259) — New FATReLU fused activation kernel.
  • Support relu2_no_mul (SYCL) for Nemotron-3-Nano-30B-A3B-bf16 (#232).
  • Support swiglustep_and_mul for Step-3.5-Flash (#199) — New SwiGLU-Step fused activation variant.

Quantization & Low-Precision

  • [OneDNN] Add MXFP8 and MXFP4 GEMM (#235) — Microscaling FP8/FP4 GEMM via oneDNN for next-gen low-precision inference.

Fusion (torch.compile)

  • Add fuse_norm_quant, fuse_act_quant, and fused_qk_norm_rope kernels (#267) — Fused normalization+quantization and QK-norm+RoPE kernels, registered as custom ops compatible with torch.compile.

MoE (Mixture of Experts)

  • Add topk=10 for remap_hidden_states kernel (#273) — Extends remap kernel to support topk=10 routing.
  • Optimize MoE GEMM (#266) — Performance improvements for MoE grouped GEMM.

Cache / Memory

  • Add swap_blocks_batch op with batched async memcpy (#265) — New batched block-swap operation using asynchronous memory copy for improved KV-cache management.

LoRA

  • Add mixed precision support for LoRA expand & shrink kernels (#230) — Enables mixed-precision (e.g., bf16/fp32) LoRA adapters.

Model Support

  • Support f32 ssm_state in GDN kernel for Qwen3.5 (#220) — Enables fp32 SSM state for Qwen3.5 Mamba-style models.
  • Follow upstream to change A_log dtype to fp32 for Qwen3.5 (#254).

Bug Fixes

  • Fix overflow of remap_hidden_states when rows is huge (#269) — Resolved integer overflow for large batch/sequence scenarios.
  • Fix XPU CPU-view tensor lifetime (#262) — Fixed use-after-free issue with CPU-view tensors on XPU.
  • Skip scales check (#256) — Fixed spurious validation failure in quantized GEMM path.
  • Skip GDN core_attn_out check for 8k length due to random numeric error (#264, #261).

Performance

  • Optimize MoE GEMM (#266) — Tuned grouped GEMM policies for better throughput.
  • Tune num_kv_splits for paged decode kernel (#257) — Improved decode attention latency.

Infrastructure & Build

  • Refactor CMake to enable selective kernel build (#260) — Allows building only a subset of kernels, reducing compile time and binary size.
  • Upgrade oneDNN to v3.11.2 (#248).
  • Use local LRU cache for oneDNN primitive caching (#275).
  • Show binary size in pre-CI (#268).
  • Update SCM version check and project Python version (#274).
  • Remove yapf (#272) — Dropped yapf formatter from the project.
  • Add psutil to pyproject.toml (#255).
  • Fix MoE benchmark (#279).

Testing

  • Refine test scope definition (#250) — Improved test profiling and scope control framework.
  • New tests: test_fused_norm_quant, test_fused_qk_norm_rope, test_fused_quant_activation, test_swiglustep_and_mul, test_fp4_gemm_onednn, test_cache (swap_blocks_batch), test_lora_ops (mixed precision).

Contributors

Thanks to all 11 contributors for this release:

Kunshang Ji, Xinyu Chen, Qiming Zhang, Qun Yang, Chaojun Zhang, Zofia, Zhefeng Qiao, Yejing Lai, Yizhou Wang, Liuzhenwei, Baodi


What's Changed

New Contributors

Full Changelog: v0.1.5...v0.1.6