Skip to content

Releases: ARM-software/kleidiai

v1.25.0

11 May 08:24

Choose a tag to compare

  • Optimizations
    • Optimize rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon for Int4 GEMM/GEMV kernels
  • Fixes
    • Fix RHS tail handling in kai_matmul_clamp_f32_qai8dxp1vlx4_qsi4cxp4vlx4_1vlx4vl_sme_mopa.

v1.24.0

23 Apr 09:12

Choose a tag to compare

  • New SME micro-kernels
    • Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI4CXP RHS with F32 output.
    • Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI4CXP RHS with F32 output.

v1.23.0

30 Mar 05:42

Choose a tag to compare

  • New SME2 micro-kernels
    • Add SME2 elastic GEMM micro-kernels.
      • The micro-kernel consists of a primary micro-kernel with 8 vscale * 8 vscale (2VLx2VL)
        output block and other micro-kernels with different output block to handle the edges.
      • Data type: FP32.
      • Instruction: SME2 MOPA.
      • New naming rule and API design are introduced with the elastic GEMM micro-kernel.
    • Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI8CXP RHS with F16 output.
    • Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI8CXP RHS with F16 output.
  • Extended the following kernels to support variable block length
    • kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
    • kai_lhs_quant_pack_qsi8d32p_f32_neon
    • kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
    • kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot
  • Documentation
    • Added overview of micro-kernels
  • Fixes
    • Update the kai_matmul_clamp_f32_qai8dxp1vlx8_qsi4cxp4vlx8_1vlx4vl_sme2_mopa kernel to multiply the zero-points and row-sums as integers instead of float to improve accuracy.
    • Implement clamping in kernels where it was missing to match their naming
      • kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4x4_1x4_neon_dotprod
      • kai_matmul_clamp_f32_qsi8d32p4x4_qsi4c32p4x4_16x4_neon_dotprod
      • kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
      • kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot

v1.22.0

30 Mar 05:42

Choose a tag to compare

  • New SME2 micro-kernels
    • Matrix Multiplication (MxN) Micro-Kernels of F16 LHS and QSI4C32P RHS with F32 input and output.
  • New Advanced SIMD packing kernel for F16PMRX2 LHS
  • Optimizations
    • Optimize kai_rhs_pack_nxk_qsu2cxp4vlx4_qsu2cx_neon and kai_lhs_quant_pack_qai8dxp_f32 further for the block depth (kr/sr = 4) with Advanced SIMD
  • Fixes
    • Minor fix to the packing parameters(kr and sr) in matmul_clamp_f32_qai8dxp_qsu2cxp kernels.
    • Fix the vectorized round off instruction in kai_lhs_quant_pack_qai8dxp_f32 for the block depth (kr/sr = 8) to be consistent with the scalar implementation.

v1.21.0

30 Mar 05:42

Choose a tag to compare

  • New SME2 micro-kernels
    • Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSU2CXP RHS with F32 output.
    • Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSU2CXP RHS with F32 output.
  • New Packing kernel for QSU2CXP RHS.
  • New SVE micro-kernels
    • Indirect matrix multiplication (MxN) Micro-Kernel of F32 input and output.
  • Added benchmarking for SVE indirect matrix multiplication (imatmul)
  • Add Bazel 9 support
  • Fixes
    • Addressed undefined behavior affecting polymorphic handling in test framework
    • Fixed GTest random seed to prevent flaky tests

v1.20.0

30 Mar 05:42

Choose a tag to compare

  • Added code to cater for no exception compilation
  • Extended the benchmarking suite to support depthwise convolution benchmarking
    • Added kai_dwconv_clamp_f32_f32_f32p1vlx1b_3x3_s1_4xc_sme2_mla to benchmarking
  • Fixes
    • Fixed clamping not being applied in matmul_clamp_f32_bf16p1x4_bf16p12x4b_1x36_neon_dot

v1.19.0

30 Mar 05:42

Choose a tag to compare

  • Added new unit test framework
  • New SME2 micro-kernels
    • Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI4C32P RHS with F32 input and output.
    • Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI4C32P RHS with F32 input and output.
  • Add clang-cl compiler macros for kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla_asm micro-kernel
  • Fixes
    • Fix incorrect API in BF16 kernel interface

v1.18.0

30 Mar 05:41

Choose a tag to compare

  • Fixes
    • Add Null Bias support for rhs_pack_kxn_x16p32x1b_x16_x16_neon.
    • Updated description of matmul file name from m_step x n_step to m_block x n_block
    • Clamp after scaling in matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme_mopa.
  • New SVE micro-kernels
    • Matrix Multiplication (MxN) Micro-Kernels with F32 input and output.
  • Update the example matmul_clamp_f32_qsi8d32p_qsi4c32p to demonstrate how a micro-kernel can be used in a multithreaded environment.
  • Documentation
    • Update documentation to use markdown syntax
    • Added FAQ
    • Added a section that describes the directory structure and a section that describes the different example implementations

v1.17.0

30 Mar 05:41

Choose a tag to compare

  • Fixes
    • Add Null Bias support for rhs_pack_kxn_x32p16x1b_x32_x32_neon.
    • Some micro-kernels report incorrect m_step value.
      • kai_lhs_quant_pack_qai8dxp_bf16_neon
      • kai_lhs_quant_pack_qai8dxp_f16_neon
      • kai_lhs_quant_pack_qai8dxp_f32
      • kai_lhs_quant_pack_qsi8d32p4x8sb_f32_neon
      • kai_lhs_quant_pack_qsi8d32p_f32
      • kai_lhs_quant_pack_qsi8d32p_f32_neon
      • kai_lhs_quant_pack_qsi8d32pscalef32_f16_neon
      • kai_lhs_quant_pack_qsi8d32pscalef32_f32_neon

v1.16.0

30 Mar 05:41

Choose a tag to compare

  • Extended the benchmarking framework to support multiple operators.
    • Initial support for matrix multiplication (matmul) & indirect matrix multiplication (imatmul)
    • Added all imatmul and matmul micro-kernels to the benchmark suite
  • Fixes:
    • All SME and SME2 micro-kernels now commit ZA lazy save buffer when building with SME support.
    • Fixed incorrect handling of zero point and scale into two packing kernels which caused incorrect de-quantisation is certain cases:
      • kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s0s1_f32_f32_f32_neon
      • kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s1s0_f32_f32_f32_neon
  • NEW SVE micro-kernels (256-bit Vector length specific):
    • Matrix multiplication (MxN) Micro-kernels of QSI8DX LHS and QSI4CX RHS with F32 input and output.
    • Matrix multiplication (1xN) Micro-kernels of QSI8DX LHS and QSI4CX RHS with F32 input and output.