Skip to content

Releases: ARM-software/kleidiai

v1.26.0

12 Jun 13:11

Choose a tag to compare

  • New SME micro-kernels
    • Added an x8 matmul pack micro-kernel family with 4vsx4 blocked layout, without packing bias.
    • Added SME RHS depthwise packing kernel for FP16
  • New SME2 micro-kernels
    • Added SME2 depthwise indirect micro-kernel for FP16.
    • kai_matmul_i32_u8p4vsx4_u8p4vsx4_i32_i32_8vsx8vs_sme2_mopa.
    • kai_matmul_clamp_f32_u8p4vsx4_u8p4vsx4_i32_i32_f32_f32_8vsx8vs_sme2_mopa.
  • Extended the following micro-kernels to support variable block length
    • kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4x4_1x4_neon_dotprod
    • kai_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod
    • kai_matmul_clamp_f32_qsi8d32p4x4_qsi4c32p4x4_16x4_neon_dotprod
    • kai_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_8x4x32_neon_i8mm
    • kai_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm
    • kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0
    • kai_lhs_quant_pack_qsi8d32p4x8sb_f32_neon
  • Fixes
    • Added ZA lazy save to kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme_dot
    • Fix QAI8/QSI8CXP matmul test failures by constraining generated qsi32 bias values to preserve int32 accumulator headroom.
    • Fix a clamping issue in matmul_clamp_qai8_qai8p_qai8p_test.cpp
    • Fix traditional matmul and imatmul packed offset helpers to use packing panel boundaries.
  • New Advanced SIMD micro-kernels
    • Matrix Multiplication MxN and 1xN Micro-Kernels of QAI8DXP LHS and QSU2CXP RHS with F32 output, optimized for FEAT_DotProd, along with RHS packing kernel.
  • Documentation
    • Contribution policy updates as part of third party contribution enablement
    • Added coding standard and conventions
  • New Transposed-B RHS packing micro-kernel versions of kai_rhs_pack_kxn_x32p16x1b_x32_x32_neon and kai_rhs_pack_kxn_x16p32x1b_x16_x16_neon:
    • kai_rhs_pack_nxk_x16p32x1bx16_x16_x16_neon
    • kai_rhs_pack_nxk_x32p16x1bx32_x32_x32_neon
  • New SME2 FP32 GEMV micro-kernel with 4vsx1 RHS format
  • New SME2 static Int8 GEMM/GEMV kernels and the RHS packing kernel.
    • kai_matmul_clamp_qai8_qai8p4vsx4_qsi8cxp4vsx4bi32sf32_8vsx8vs_sme2_mopa
    • kai_matmul_clamp_qai8_qai8_qsi8cxp4vsx4bi32sf32_1x32vs_sme2_dot
    • kai_matmul_pack_rhs_kxn_qsi8cxp4vsx4bi32sf32_qsi8_i32_f32_sme

v1.25.0

11 May 08:24

Choose a tag to compare

  • Optimizations
    • Optimize rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon for Int4 GEMM/GEMV kernels
  • Fixes
    • Fix RHS tail handling in kai_matmul_clamp_f32_qai8dxp1vlx4_qsi4cxp4vlx4_1vlx4vl_sme_mopa.

v1.24.0

23 Apr 09:12

Choose a tag to compare

  • New SME micro-kernels
    • Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI4CXP RHS with F32 output.
    • Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI4CXP RHS with F32 output.

v1.23.0

30 Mar 05:42

Choose a tag to compare

  • New SME2 micro-kernels
    • Add SME2 elastic GEMM micro-kernels.
      • The micro-kernel consists of a primary micro-kernel with 8 vscale * 8 vscale (2VLx2VL)
        output block and other micro-kernels with different output block to handle the edges.
      • Data type: FP32.
      • Instruction: SME2 MOPA.
      • New naming rule and API design are introduced with the elastic GEMM micro-kernel.
    • Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI8CXP RHS with F16 output.
    • Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI8CXP RHS with F16 output.
  • Extended the following kernels to support variable block length
    • kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
    • kai_lhs_quant_pack_qsi8d32p_f32_neon
    • kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
    • kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot
  • Documentation
    • Added overview of micro-kernels
  • Fixes
    • Update the kai_matmul_clamp_f32_qai8dxp1vlx8_qsi4cxp4vlx8_1vlx4vl_sme2_mopa kernel to multiply the zero-points and row-sums as integers instead of float to improve accuracy.
    • Implement clamping in kernels where it was missing to match their naming
      • kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4x4_1x4_neon_dotprod
      • kai_matmul_clamp_f32_qsi8d32p4x4_qsi4c32p4x4_16x4_neon_dotprod
      • kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
      • kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot

v1.22.0

30 Mar 05:42

Choose a tag to compare

  • New SME2 micro-kernels
    • Matrix Multiplication (MxN) Micro-Kernels of F16 LHS and QSI4C32P RHS with F32 input and output.
  • New Advanced SIMD packing kernel for F16PMRX2 LHS
  • Optimizations
    • Optimize kai_rhs_pack_nxk_qsu2cxp4vlx4_qsu2cx_neon and kai_lhs_quant_pack_qai8dxp_f32 further for the block depth (kr/sr = 4) with Advanced SIMD
  • Fixes
    • Minor fix to the packing parameters(kr and sr) in matmul_clamp_f32_qai8dxp_qsu2cxp kernels.
    • Fix the vectorized round off instruction in kai_lhs_quant_pack_qai8dxp_f32 for the block depth (kr/sr = 8) to be consistent with the scalar implementation.

v1.21.0

30 Mar 05:42

Choose a tag to compare

  • New SME2 micro-kernels
    • Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSU2CXP RHS with F32 output.
    • Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSU2CXP RHS with F32 output.
  • New Packing kernel for QSU2CXP RHS.
  • New SVE micro-kernels
    • Indirect matrix multiplication (MxN) Micro-Kernel of F32 input and output.
  • Added benchmarking for SVE indirect matrix multiplication (imatmul)
  • Add Bazel 9 support
  • Fixes
    • Addressed undefined behavior affecting polymorphic handling in test framework
    • Fixed GTest random seed to prevent flaky tests

v1.20.0

30 Mar 05:42

Choose a tag to compare

  • Added code to cater for no exception compilation
  • Extended the benchmarking suite to support depthwise convolution benchmarking
    • Added kai_dwconv_clamp_f32_f32_f32p1vlx1b_3x3_s1_4xc_sme2_mla to benchmarking
  • Fixes
    • Fixed clamping not being applied in matmul_clamp_f32_bf16p1x4_bf16p12x4b_1x36_neon_dot

v1.19.0

30 Mar 05:42

Choose a tag to compare

  • Added new unit test framework
  • New SME2 micro-kernels
    • Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI4C32P RHS with F32 input and output.
    • Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI4C32P RHS with F32 input and output.
  • Add clang-cl compiler macros for kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla_asm micro-kernel
  • Fixes
    • Fix incorrect API in BF16 kernel interface

v1.18.0

30 Mar 05:41

Choose a tag to compare

  • Fixes
    • Add Null Bias support for rhs_pack_kxn_x16p32x1b_x16_x16_neon.
    • Updated description of matmul file name from m_step x n_step to m_block x n_block
    • Clamp after scaling in matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme_mopa.
  • New SVE micro-kernels
    • Matrix Multiplication (MxN) Micro-Kernels with F32 input and output.
  • Update the example matmul_clamp_f32_qsi8d32p_qsi4c32p to demonstrate how a micro-kernel can be used in a multithreaded environment.
  • Documentation
    • Update documentation to use markdown syntax
    • Added FAQ
    • Added a section that describes the directory structure and a section that describes the different example implementations

v1.17.0

30 Mar 05:41

Choose a tag to compare

  • Fixes
    • Add Null Bias support for rhs_pack_kxn_x32p16x1b_x32_x32_neon.
    • Some micro-kernels report incorrect m_step value.
      • kai_lhs_quant_pack_qai8dxp_bf16_neon
      • kai_lhs_quant_pack_qai8dxp_f16_neon
      • kai_lhs_quant_pack_qai8dxp_f32
      • kai_lhs_quant_pack_qsi8d32p4x8sb_f32_neon
      • kai_lhs_quant_pack_qsi8d32p_f32
      • kai_lhs_quant_pack_qsi8d32p_f32_neon
      • kai_lhs_quant_pack_qsi8d32pscalef32_f16_neon
      • kai_lhs_quant_pack_qsi8d32pscalef32_f32_neon