Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

[Unreleased]

Flash Attention v2 kernels for Ampere GPUs (SM_80/SM_86)
Flash Attention v3 kernels for Hopper GPUs (SM_90)
DeepSeek Multi-Head Latent Attention (MLA) decode kernel
DeepSeek Sparse Attention (DSA) decode kernel
DeepSeek Native Sparse Attention (NSA) kernels (forward, top-k, window sliding, mean pooling)
Multi-Head Attention (MHA) and Group Query Attention (GQA) forward/backward/decode ops
Paged KV-cache support for MHA/GQA decode
MatMul (GEMM/GEMV) kernel with auto-tuning support
Grouped GEMM kernel
1D C2C FFT kernels (radix and LUT variants)
Manifold-Constrained Hyper-Connection (MHC) pre/post kernels
FP8 quantization and lighting indexer kernels
Top-k selector kernel
2-layer hierarchical API: Kernel → Op
Auto-tuning infrastructure for kernel parameter search
Test framework (TestBase, FixtureBase) and benchmark framework (BenchmarkBase, BenchmarkReport)
CI with pre-commit linting, packaging, and GPU-based test runs