Skip to content

Releases: ARM-software/kleidiai

v1.16.0

30 Mar 05:41

Choose a tag to compare

  • Extended the benchmarking framework to support multiple operators.
    • Initial support for matrix multiplication (matmul) & indirect matrix multiplication (imatmul)
    • Added all imatmul and matmul micro-kernels to the benchmark suite
  • Fixes:
    • All SME and SME2 micro-kernels now commit ZA lazy save buffer when building with SME support.
    • Fixed incorrect handling of zero point and scale into two packing kernels which caused incorrect de-quantisation is certain cases:
      • kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s0s1_f32_f32_f32_neon
      • kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s1s0_f32_f32_f32_neon
  • NEW SVE micro-kernels (256-bit Vector length specific):
    • Matrix multiplication (MxN) Micro-kernels of QSI8DX LHS and QSI4CX RHS with F32 input and output.
    • Matrix multiplication (1xN) Micro-kernels of QSI8DX LHS and QSI4CX RHS with F32 input and output.

v1.15.1

30 Mar 05:41

Choose a tag to compare

  • Fixes
    • Added missing checks for bf16 support for quantised matmuls with bf16 input/output.

v1.15.0

30 Mar 05:41

Choose a tag to compare

  • New SME micro-kernels:
    • Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI8CX RHS with F32 input and output.
    • Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI8CX RHS with F32 input and output.
  • Wider compiler compatibility for the following kernels:
    • kai_matmul_clamp_f16_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa
    • kai_matmul_clamp_f16_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dot
    • kai_matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
    • kai_matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla
    • kai_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa
    • kai_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa
    • kai_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme_mopa
    • kai_matmul_clamp_f32_qai8dxp1x4_qsi8cxp4vlx4_1x4vl_sme2_dot
    • kai_matmul_clamp_f32_qai8dxp1x4_qsi8cxp4vlx4_1x4vl_sme_dot
    • kai_matmul_clamp_f32_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa
    • kai_matmul_clamp_f32_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dot
    • kai_matmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa

v1.14.0

30 Mar 05:41

Choose a tag to compare

  • New SME micro-kernels:
    • Indirect matrix multiplication (MxN) of QAI8 input and output.
    • Indirect matrix multiplication (MxN) of F16 input and output.
    • Indirect matrix multiplication (MxN) of F32 input and output.
    • Matrix multiplication (MxN) of QAI8 LHS and RHS with QAI8 output.
    • Depthwise Convolution RHS F32 Packing kernel.
  • New SME2 micro-kernels:
    • Depthwise Convolution (3x3) Planar kernel of F32 LHS and Packed F32 RHS with F32 output using MLA.
  • Convert SME2 matmul micro-kernels to pure assembly, and add MSVC support.
    • Affects: kai_matmul_clamp_f32_bf16p2vlx2_bf16p2vlx2_2vlx2vl_sme2_mopa
  • Optimizations:
    • Packing functions kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s1s0_f32_f32_f32_neon and kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s0s1_f32_f32_f32_neon have been further optimized.
    • Packing function kai_lhs_quant_pack_qai8dxp_f16_neon has been further optimized.
  • New Advanced SIMD micro-kernels:
    • Wider 6x32 block size variants of FP16 Matrix Multiplication, including a variant optimized for the Arm® Cortex®-A55 processor.
    • Wider 6x16 block size variants of FP32 Matrix Multiplication, including a variant optimized for the Arm® Cortex®-A55 processor.
  • Fixes:
    • Fix out-of-bound read of intermediate values in kai_matmul_clamp_f16_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa micro-kernel
    • Fix out-of-bounds write in kai_matmul_clamp_f16_f16_f16p2vlx2b_1x8vl_sme_mla
    • Fix out-of-bounds read in kai_matmul_clamp_qai8_qai8_qsi8cxp2vlx4sb_1x16vl_sme2_dot

v1.13.0

30 Mar 05:40

Choose a tag to compare

  • Improve performance of lhs_quant_pack_qsi8d32p_f32 using Advanced SIMD reimplemented as lhs_quant_pack_qsi8d32p4x8sb_f32_neon.
  • New SME2 micro-kernels:
    • Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_SME2.
    • Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_SME2.
    • Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_SME2.
    • Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_SME2.

v1.12.0

30 Mar 05:40

Choose a tag to compare

  • New Advanced SIMD micro-kernels:
    • Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with BF16 output, optimized for FEAT_I8MM.
    • Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with BF16 output, optimized for FEAT_DotProd.
    • Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI4C32 RHS with BF16 output, optimized for FEAT_I8MM.
    • Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI4C32 RHS with BF16 output, optimized for FEAT_DotProd.
  • New SME micro-kernels:
    • Matrix multiplication (1xN) of F32 LHS and RHS with F32 output, using instructions compatible with FEAT_SME.
    • Matrix multiplication (1xN) of F16 LHS and RHS with F16 output, using instructions compatible with FEAT_SME.
  • Convert SME transposed RHS packing micro-kernels to pure assembly:
    • kai_rhs_pack_nxk_f32p2vlx1biasf32_f32_f32_sme
    • kai_rhs_pack_nxk_x16p2vlx2b_x16_x16_sme
  • Include more micro-kernels in MSVC build:
    • kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla
    • kai_lhs_quant_pack_qsi8d32p_f32_neon
    • kai_rhs_pack_kxn_qsi8cxp_qsi8cx_neon
    • kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
    • kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon
    • kai_rhs_pack_nxk_qsi8cxp_qsi8cx_neon
  • Fixes
    • Update kai_kernel_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa to improve accuracy
    • Convert common SME/SME2 code into assembly file kai_common_sme_asm.S
  • Documentation
    • Added ONNX Runtime MLAS library integration example.

v1.11.0

30 Mar 05:40

Choose a tag to compare

  • New Advanced SIMD micro-kernels:
    • Optimized version of kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 kernel for block depth of 4 bytes (kai_rhs_pack_nxk_qsi4c32pnrx4_qsu4c32s1s0_neon)
  • Improve performance of kai_rhs_pack_nxk_qsi4c32pnrx8_qsu4c32s1s0_neon

v1.10.0

30 Mar 05:40

Choose a tag to compare

  • Convert SME and SME2 imatmul micro-kernels to use pure assembly, and add MSVC support. Affects:
    • kai_imatmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa
    • kai_imatmul_clamp_f32_f32p2vlx1_f32p2vlx1b_2vlx2vl_sme2_mopa
    • kai_imatmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa
    • kai_lhs_imatmul_pack_x16p2vlx2_x16p_sme
    • kai_lhs_imatmul_pack_x32p2vlx1_x32p_sme
    • kai_lhs_imatmul_pack_x8p2vlx4_x8p_sme
    • kai_rhs_imatmul_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme
    • kai_rhs_imatmul_pack_kxn_x16p2vlx2b_x16_x16_sme
    • kai_rhs_imatmul_pack_kxn_x32p2vlx1b_x32_x32_sme
  • Convert SME and SME2 matmul micro-kernels to pure assembly, and add MSVC support. Affects:
    • kai_lhs_pack_f32p2vlx1_f32_sme
    • kai_lhs_pack_x16p2vlx2_x16_sme
    • kai_lhs_pack_x8p2vlx4_x8_sme
    • kai_matmul_clamp_f16_f16_f16p2vlx2b_1x16vl_sme2_dot
    • kai_matmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa
    • kai_matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
    • kai_matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla
    • kai_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa
    • kai_matmul_clamp_qai8_qai8_qsi8cxp2vlx4sb_1x16vl_sme2_dot
    • kai_matmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa
    • kai_rhs_pack_kxn_f32p16vlx1b_f32_f32_sme
    • kai_rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme
    • kai_rhs_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme
    • kai_rhs_pack_kxn_x16p2vlx2b_x16_x16_sme
  • New Advanced SIMD micro-kernels:
    • Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_DotProd.
    • Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_DotProd.
    • Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_DotProd.
    • Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_DotProd.
    • Optimized version of kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 kernel for block depth of 8 bytes (kai_rhs_pack_nxk_qsi4c32pnrx8_qsu4c32s1s0_neon)
  • New SME micro-kernels:
    • Added GEMM F16 and F32 kernels using SME1 MOPA instruction, block size 2VLx2VL.
  • Added Convolution example using SME2 Indirect Matmul Kernels
  • Fixes:
    • Fix issue where kai_get_m_step() returns the incorrect value for kernels
      • matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
      • matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla
    • Fix issue with negative values handling in kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon

v1.9.0

30 Mar 05:39

Choose a tag to compare

  • Extend support for signed 4-bit integer inputs in kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon.
  • Add imatmul documentation
  • Better out-of-bounds access detection support in testing framework.
  • New SME2 micro-kernels:
    • Matrix multiplication (1xN) of QAI8DX LHS and QSI8CX RHS to produce F32 output.
    • Matrix multiplication (MxN) of QAI8DX LHS and QSI8CX RHS to produce F32 output.
  • Fixes:
    • Address segmentation faults in benchmarking tool.
    • Fix clamping issues for FP16 and BF16 in testing framework.

v1.8.0

30 Mar 05:39

Choose a tag to compare

  • New Advanced SIMD micro-kernels:
    • Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI8CX RHS with F16 output, optimized for FEAT_I8MM and FEAT_DotProd.
    • Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI8CX RHS with F16 output, optimized for FEAT_DotProd.
  • New SME micro-kernels:
    • Indirect matrix multiplication (MxN) of F16 input and output.
      • Packing kernels for LHS and RHS
    • Indirect matrix multiplication (MxN) of F32 input and output.
      • Packing kernels for LHS and RHS
  • New SME2 micro-kernels:
    • Indirect matrix multiplication (MxN) of F16 input and output.
      • Matrix multiplication of packed indirect LHS and packed RHS
    • Indirect matrix multiplication (MxN) of F32 input and output.
      • Matrix multiplication of packed indirect LHS and packed RHS
  • Disable link time optimization for microkernel library