Releases: ARM-software/kleidiai
Releases · ARM-software/kleidiai
v1.25.0
v1.24.0
- New SME micro-kernels
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI4CXP RHS with F32 output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI4CXP RHS with F32 output.
v1.23.0
- New SME2 micro-kernels
- Add SME2 elastic GEMM micro-kernels.
- The micro-kernel consists of a primary micro-kernel with 8 vscale * 8 vscale (2VLx2VL)
output block and other micro-kernels with different output block to handle the edges. - Data type: FP32.
- Instruction: SME2 MOPA.
- New naming rule and API design are introduced with the elastic GEMM micro-kernel.
- The micro-kernel consists of a primary micro-kernel with 8 vscale * 8 vscale (2VLx2VL)
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI8CXP RHS with F16 output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI8CXP RHS with F16 output.
- Add SME2 elastic GEMM micro-kernels.
- Extended the following kernels to support variable block length
- kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
- kai_lhs_quant_pack_qsi8d32p_f32_neon
- kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
- kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot
- Documentation
- Added overview of micro-kernels
- Fixes
- Update the kai_matmul_clamp_f32_qai8dxp1vlx8_qsi4cxp4vlx8_1vlx4vl_sme2_mopa kernel to multiply the zero-points and row-sums as integers instead of float to improve accuracy.
- Implement clamping in kernels where it was missing to match their naming
- kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4x4_1x4_neon_dotprod
- kai_matmul_clamp_f32_qsi8d32p4x4_qsi4c32p4x4_16x4_neon_dotprod
- kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
- kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot
v1.22.0
- New SME2 micro-kernels
- Matrix Multiplication (MxN) Micro-Kernels of F16 LHS and QSI4C32P RHS with F32 input and output.
- New Advanced SIMD packing kernel for F16PMRX2 LHS
- Optimizations
- Optimize kai_rhs_pack_nxk_qsu2cxp4vlx4_qsu2cx_neon and kai_lhs_quant_pack_qai8dxp_f32 further for the block depth (kr/sr = 4) with Advanced SIMD
- Fixes
- Minor fix to the packing parameters(kr and sr) in matmul_clamp_f32_qai8dxp_qsu2cxp kernels.
- Fix the vectorized round off instruction in kai_lhs_quant_pack_qai8dxp_f32 for the block depth (kr/sr = 8) to be consistent with the scalar implementation.
v1.21.0
- New SME2 micro-kernels
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSU2CXP RHS with F32 output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSU2CXP RHS with F32 output.
- New Packing kernel for QSU2CXP RHS.
- New SVE micro-kernels
- Indirect matrix multiplication (MxN) Micro-Kernel of F32 input and output.
- Added benchmarking for SVE indirect matrix multiplication (imatmul)
- Add Bazel 9 support
- Fixes
- Addressed undefined behavior affecting polymorphic handling in test framework
- Fixed GTest random seed to prevent flaky tests
v1.20.0
- Added code to cater for no exception compilation
- Extended the benchmarking suite to support depthwise convolution benchmarking
- Added kai_dwconv_clamp_f32_f32_f32p1vlx1b_3x3_s1_4xc_sme2_mla to benchmarking
- Fixes
- Fixed clamping not being applied in
matmul_clamp_f32_bf16p1x4_bf16p12x4b_1x36_neon_dot
- Fixed clamping not being applied in
v1.19.0
- Added new unit test framework
- New SME2 micro-kernels
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI4C32P RHS with F32 input and output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI4C32P RHS with F32 input and output.
- Add clang-cl compiler macros for kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla_asm micro-kernel
- Fixes
- Fix incorrect API in BF16 kernel interface
v1.18.0
- Fixes
- Add Null Bias support for rhs_pack_kxn_x16p32x1b_x16_x16_neon.
- Updated description of matmul file name from m_step x n_step to m_block x n_block
- Clamp after scaling in
matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme_mopa.
- New SVE micro-kernels
- Matrix Multiplication (MxN) Micro-Kernels with F32 input and output.
- Update the example matmul_clamp_f32_qsi8d32p_qsi4c32p to demonstrate how a micro-kernel can be used in a multithreaded environment.
- Documentation
- Update documentation to use markdown syntax
- Added FAQ
- Added a section that describes the directory structure and a section that describes the different example implementations
v1.17.0
- Fixes
- Add Null Bias support for rhs_pack_kxn_x32p16x1b_x32_x32_neon.
- Some micro-kernels report incorrect m_step value.
- kai_lhs_quant_pack_qai8dxp_bf16_neon
- kai_lhs_quant_pack_qai8dxp_f16_neon
- kai_lhs_quant_pack_qai8dxp_f32
- kai_lhs_quant_pack_qsi8d32p4x8sb_f32_neon
- kai_lhs_quant_pack_qsi8d32p_f32
- kai_lhs_quant_pack_qsi8d32p_f32_neon
- kai_lhs_quant_pack_qsi8d32pscalef32_f16_neon
- kai_lhs_quant_pack_qsi8d32pscalef32_f32_neon
v1.16.0
- Extended the benchmarking framework to support multiple operators.
- Initial support for matrix multiplication (matmul) & indirect matrix multiplication (imatmul)
- Added all imatmul and matmul micro-kernels to the benchmark suite
- Fixes:
- All SME and SME2 micro-kernels now commit ZA lazy save buffer when building with SME support.
- Fixed incorrect handling of zero point and scale into two packing kernels which caused incorrect de-quantisation is certain cases:
- kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s0s1_f32_f32_f32_neon
- kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s1s0_f32_f32_f32_neon
- NEW SVE micro-kernels (256-bit Vector length specific):
- Matrix multiplication (MxN) Micro-kernels of QSI8DX LHS and QSI4CX RHS with F32 input and output.
- Matrix multiplication (1xN) Micro-kernels of QSI8DX LHS and QSI4CX RHS with F32 input and output.