Releases: ARM-software/kleidiai
Releases · ARM-software/kleidiai
v1.26.0
- New SME micro-kernels
- Added an x8 matmul pack micro-kernel family with 4vsx4 blocked layout, without packing bias.
- Added SME RHS depthwise packing kernel for FP16
- New SME2 micro-kernels
- Added SME2 depthwise indirect micro-kernel for FP16.
- kai_matmul_i32_u8p4vsx4_u8p4vsx4_i32_i32_8vsx8vs_sme2_mopa.
- kai_matmul_clamp_f32_u8p4vsx4_u8p4vsx4_i32_i32_f32_f32_8vsx8vs_sme2_mopa.
- Extended the following micro-kernels to support variable block length
- kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4x4_1x4_neon_dotprod
- kai_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod
- kai_matmul_clamp_f32_qsi8d32p4x4_qsi4c32p4x4_16x4_neon_dotprod
- kai_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_8x4x32_neon_i8mm
- kai_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm
- kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0
- kai_lhs_quant_pack_qsi8d32p4x8sb_f32_neon
- Fixes
- Added ZA lazy save to kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme_dot
- Fix QAI8/QSI8CXP matmul test failures by constraining generated qsi32 bias values to preserve int32 accumulator headroom.
- Fix a clamping issue in matmul_clamp_qai8_qai8p_qai8p_test.cpp
- Fix traditional matmul and imatmul packed offset helpers to use packing panel boundaries.
- New Advanced SIMD micro-kernels
- Matrix Multiplication MxN and 1xN Micro-Kernels of QAI8DXP LHS and QSU2CXP RHS with F32 output, optimized for FEAT_DotProd, along with RHS packing kernel.
- Documentation
- Contribution policy updates as part of third party contribution enablement
- Added coding standard and conventions
- New Transposed-B RHS packing micro-kernel versions of kai_rhs_pack_kxn_x32p16x1b_x32_x32_neon and kai_rhs_pack_kxn_x16p32x1b_x16_x16_neon:
- kai_rhs_pack_nxk_x16p32x1bx16_x16_x16_neon
- kai_rhs_pack_nxk_x32p16x1bx32_x32_x32_neon
- New SME2 FP32 GEMV micro-kernel with 4vsx1 RHS format
- New SME2 static Int8 GEMM/GEMV kernels and the RHS packing kernel.
- kai_matmul_clamp_qai8_qai8p4vsx4_qsi8cxp4vsx4bi32sf32_8vsx8vs_sme2_mopa
- kai_matmul_clamp_qai8_qai8_qsi8cxp4vsx4bi32sf32_1x32vs_sme2_dot
- kai_matmul_pack_rhs_kxn_qsi8cxp4vsx4bi32sf32_qsi8_i32_f32_sme
v1.25.0
- Optimizations
- Optimize rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon for Int4 GEMM/GEMV kernels
- Fixes
- Fix RHS tail handling in
kai_matmul_clamp_f32_qai8dxp1vlx4_qsi4cxp4vlx4_1vlx4vl_sme_mopa.
- Fix RHS tail handling in
v1.24.0
- New SME micro-kernels
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI4CXP RHS with F32 output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI4CXP RHS with F32 output.
v1.23.0
- New SME2 micro-kernels
- Add SME2 elastic GEMM micro-kernels.
- The micro-kernel consists of a primary micro-kernel with 8 vscale * 8 vscale (2VLx2VL)
output block and other micro-kernels with different output block to handle the edges. - Data type: FP32.
- Instruction: SME2 MOPA.
- New naming rule and API design are introduced with the elastic GEMM micro-kernel.
- The micro-kernel consists of a primary micro-kernel with 8 vscale * 8 vscale (2VLx2VL)
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI8CXP RHS with F16 output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI8CXP RHS with F16 output.
- Add SME2 elastic GEMM micro-kernels.
- Extended the following kernels to support variable block length
- kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
- kai_lhs_quant_pack_qsi8d32p_f32_neon
- kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
- kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot
- Documentation
- Added overview of micro-kernels
- Fixes
- Update the kai_matmul_clamp_f32_qai8dxp1vlx8_qsi4cxp4vlx8_1vlx4vl_sme2_mopa kernel to multiply the zero-points and row-sums as integers instead of float to improve accuracy.
- Implement clamping in kernels where it was missing to match their naming
- kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4x4_1x4_neon_dotprod
- kai_matmul_clamp_f32_qsi8d32p4x4_qsi4c32p4x4_16x4_neon_dotprod
- kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
- kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot
v1.22.0
- New SME2 micro-kernels
- Matrix Multiplication (MxN) Micro-Kernels of F16 LHS and QSI4C32P RHS with F32 input and output.
- New Advanced SIMD packing kernel for F16PMRX2 LHS
- Optimizations
- Optimize kai_rhs_pack_nxk_qsu2cxp4vlx4_qsu2cx_neon and kai_lhs_quant_pack_qai8dxp_f32 further for the block depth (kr/sr = 4) with Advanced SIMD
- Fixes
- Minor fix to the packing parameters(kr and sr) in matmul_clamp_f32_qai8dxp_qsu2cxp kernels.
- Fix the vectorized round off instruction in kai_lhs_quant_pack_qai8dxp_f32 for the block depth (kr/sr = 8) to be consistent with the scalar implementation.
v1.21.0
- New SME2 micro-kernels
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSU2CXP RHS with F32 output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSU2CXP RHS with F32 output.
- New Packing kernel for QSU2CXP RHS.
- New SVE micro-kernels
- Indirect matrix multiplication (MxN) Micro-Kernel of F32 input and output.
- Added benchmarking for SVE indirect matrix multiplication (imatmul)
- Add Bazel 9 support
- Fixes
- Addressed undefined behavior affecting polymorphic handling in test framework
- Fixed GTest random seed to prevent flaky tests
v1.20.0
- Added code to cater for no exception compilation
- Extended the benchmarking suite to support depthwise convolution benchmarking
- Added kai_dwconv_clamp_f32_f32_f32p1vlx1b_3x3_s1_4xc_sme2_mla to benchmarking
- Fixes
- Fixed clamping not being applied in
matmul_clamp_f32_bf16p1x4_bf16p12x4b_1x36_neon_dot
- Fixed clamping not being applied in
v1.19.0
- Added new unit test framework
- New SME2 micro-kernels
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI4C32P RHS with F32 input and output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI4C32P RHS with F32 input and output.
- Add clang-cl compiler macros for kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla_asm micro-kernel
- Fixes
- Fix incorrect API in BF16 kernel interface
v1.18.0
- Fixes
- Add Null Bias support for rhs_pack_kxn_x16p32x1b_x16_x16_neon.
- Updated description of matmul file name from m_step x n_step to m_block x n_block
- Clamp after scaling in
matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme_mopa.
- New SVE micro-kernels
- Matrix Multiplication (MxN) Micro-Kernels with F32 input and output.
- Update the example matmul_clamp_f32_qsi8d32p_qsi4c32p to demonstrate how a micro-kernel can be used in a multithreaded environment.
- Documentation
- Update documentation to use markdown syntax
- Added FAQ
- Added a section that describes the directory structure and a section that describes the different example implementations
v1.17.0
- Fixes
- Add Null Bias support for rhs_pack_kxn_x32p16x1b_x32_x32_neon.
- Some micro-kernels report incorrect m_step value.
- kai_lhs_quant_pack_qai8dxp_bf16_neon
- kai_lhs_quant_pack_qai8dxp_f16_neon
- kai_lhs_quant_pack_qai8dxp_f32
- kai_lhs_quant_pack_qsi8d32p4x8sb_f32_neon
- kai_lhs_quant_pack_qsi8d32p_f32
- kai_lhs_quant_pack_qsi8d32p_f32_neon
- kai_lhs_quant_pack_qsi8d32pscalef32_f16_neon
- kai_lhs_quant_pack_qsi8d32pscalef32_f32_neon