Releases · ARM-software/kleidiai

30 Mar 05:41

EmilOhlssonARM

v1.16.0

84796ec

v1.16.0

Extended the benchmarking framework to support multiple operators.
- Initial support for matrix multiplication (matmul) & indirect matrix multiplication (imatmul)
- Added all imatmul and matmul micro-kernels to the benchmark suite
Fixes:
- All SME and SME2 micro-kernels now commit ZA lazy save buffer when building with SME support.
- Fixed incorrect handling of zero point and scale into two packing kernels which caused incorrect de-quantisation is certain cases:
  - kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s0s1_f32_f32_f32_neon
  - kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s1s0_f32_f32_f32_neon
NEW SVE micro-kernels (256-bit Vector length specific):
- Matrix multiplication (MxN) Micro-kernels of QSI8DX LHS and QSI4CX RHS with F32 input and output.
- Matrix multiplication (1xN) Micro-kernels of QSI8DX LHS and QSI4CX RHS with F32 input and output.

Assets 4

30 Mar 05:41

EmilOhlssonARM

v1.15.1

bb1057f

v1.15.1

Fixes
- Added missing checks for bf16 support for quantised matmuls with bf16 input/output.

Assets 4

30 Mar 05:41

EmilOhlssonARM

v1.15.0

d7770c8

v1.15.0

New SME micro-kernels:
- Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI8CX RHS with F32 input and output.
- Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI8CX RHS with F32 input and output.
Wider compiler compatibility for the following kernels:
- kai_matmul_clamp_f16_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa
- kai_matmul_clamp_f16_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dot
- kai_matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
- kai_matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla
- kai_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa
- kai_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa
- kai_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme_mopa
- kai_matmul_clamp_f32_qai8dxp1x4_qsi8cxp4vlx4_1x4vl_sme2_dot
- kai_matmul_clamp_f32_qai8dxp1x4_qsi8cxp4vlx4_1x4vl_sme_dot
- kai_matmul_clamp_f32_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa
- kai_matmul_clamp_f32_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dot
- kai_matmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa

Assets 4

30 Mar 05:41

EmilOhlssonARM

v1.14.0

bd2e6ae

v1.14.0

New SME micro-kernels:
- Indirect matrix multiplication (MxN) of QAI8 input and output.
- Indirect matrix multiplication (MxN) of F16 input and output.
- Indirect matrix multiplication (MxN) of F32 input and output.
- Matrix multiplication (MxN) of QAI8 LHS and RHS with QAI8 output.
- Depthwise Convolution RHS F32 Packing kernel.
New SME2 micro-kernels:
- Depthwise Convolution (3x3) Planar kernel of F32 LHS and Packed F32 RHS with F32 output using MLA.
Convert SME2 matmul micro-kernels to pure assembly, and add MSVC support.
- Affects: kai_matmul_clamp_f32_bf16p2vlx2_bf16p2vlx2_2vlx2vl_sme2_mopa
Optimizations:
- Packing functions kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s1s0_f32_f32_f32_neon and kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s0s1_f32_f32_f32_neon have been further optimized.
- Packing function kai_lhs_quant_pack_qai8dxp_f16_neon has been further optimized.
New Advanced SIMD micro-kernels:
- Wider 6x32 block size variants of FP16 Matrix Multiplication, including a variant optimized for the Arm® Cortex®-A55 processor.
- Wider 6x16 block size variants of FP32 Matrix Multiplication, including a variant optimized for the Arm® Cortex®-A55 processor.
Fixes:
- Fix out-of-bound read of intermediate values in kai_matmul_clamp_f16_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa micro-kernel
- Fix out-of-bounds write in kai_matmul_clamp_f16_f16_f16p2vlx2b_1x8vl_sme_mla
- Fix out-of-bounds read in kai_matmul_clamp_qai8_qai8_qsi8cxp2vlx4sb_1x16vl_sme2_dot

Assets 4

30 Mar 05:40

EmilOhlssonARM

v1.13.0

3e5da90

v1.13.0

Improve performance of lhs_quant_pack_qsi8d32p_f32 using Advanced SIMD reimplemented as lhs_quant_pack_qsi8d32p4x8sb_f32_neon.
New SME2 micro-kernels:
- Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_SME2.
- Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_SME2.
- Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_SME2.
- Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_SME2.

Assets 4

30 Mar 05:40

EmilOhlssonARM

v1.12.0

8ca2267

v1.12.0

New Advanced SIMD micro-kernels:
- Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with BF16 output, optimized for FEAT_I8MM.
- Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with BF16 output, optimized for FEAT_DotProd.
- Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI4C32 RHS with BF16 output, optimized for FEAT_I8MM.
- Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI4C32 RHS with BF16 output, optimized for FEAT_DotProd.
New SME micro-kernels:
- Matrix multiplication (1xN) of F32 LHS and RHS with F32 output, using instructions compatible with FEAT_SME.
- Matrix multiplication (1xN) of F16 LHS and RHS with F16 output, using instructions compatible with FEAT_SME.
Convert SME transposed RHS packing micro-kernels to pure assembly:
- kai_rhs_pack_nxk_f32p2vlx1biasf32_f32_f32_sme
- kai_rhs_pack_nxk_x16p2vlx2b_x16_x16_sme
Include more micro-kernels in MSVC build:
- kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla
- kai_lhs_quant_pack_qsi8d32p_f32_neon
- kai_rhs_pack_kxn_qsi8cxp_qsi8cx_neon
- kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
- kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon
- kai_rhs_pack_nxk_qsi8cxp_qsi8cx_neon
Fixes
- Update kai_kernel_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa to improve accuracy
- Convert common SME/SME2 code into assembly file kai_common_sme_asm.S
Documentation
- Added ONNX Runtime MLAS library integration example.

Assets 4

30 Mar 05:40

EmilOhlssonARM

v1.11.0

f362d32

v1.11.0

New Advanced SIMD micro-kernels:
- Optimized version of kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 kernel for block depth of 4 bytes (kai_rhs_pack_nxk_qsi4c32pnrx4_qsu4c32s1s0_neon)
Improve performance of kai_rhs_pack_nxk_qsi4c32pnrx8_qsu4c32s1s0_neon

Assets 4

30 Mar 05:40

EmilOhlssonARM

v1.10.0

184e45c

v1.10.0

Convert SME and SME2 imatmul micro-kernels to use pure assembly, and add MSVC support. Affects:
- kai_imatmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa
- kai_imatmul_clamp_f32_f32p2vlx1_f32p2vlx1b_2vlx2vl_sme2_mopa
- kai_imatmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa
- kai_lhs_imatmul_pack_x16p2vlx2_x16p_sme
- kai_lhs_imatmul_pack_x32p2vlx1_x32p_sme
- kai_lhs_imatmul_pack_x8p2vlx4_x8p_sme
- kai_rhs_imatmul_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme
- kai_rhs_imatmul_pack_kxn_x16p2vlx2b_x16_x16_sme
- kai_rhs_imatmul_pack_kxn_x32p2vlx1b_x32_x32_sme
Convert SME and SME2 matmul micro-kernels to pure assembly, and add MSVC support. Affects:
- kai_lhs_pack_f32p2vlx1_f32_sme
- kai_lhs_pack_x16p2vlx2_x16_sme
- kai_lhs_pack_x8p2vlx4_x8_sme
- kai_matmul_clamp_f16_f16_f16p2vlx2b_1x16vl_sme2_dot
- kai_matmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa
- kai_matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
- kai_matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla
- kai_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa
- kai_matmul_clamp_qai8_qai8_qsi8cxp2vlx4sb_1x16vl_sme2_dot
- kai_matmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa
- kai_rhs_pack_kxn_f32p16vlx1b_f32_f32_sme
- kai_rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme
- kai_rhs_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme
- kai_rhs_pack_kxn_x16p2vlx2b_x16_x16_sme
New Advanced SIMD micro-kernels:
- Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_DotProd.
- Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_DotProd.
- Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_DotProd.
- Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_DotProd.
- Optimized version of kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 kernel for block depth of 8 bytes (kai_rhs_pack_nxk_qsi4c32pnrx8_qsu4c32s1s0_neon)
New SME micro-kernels:
- Added GEMM F16 and F32 kernels using SME1 MOPA instruction, block size 2VLx2VL.
Added Convolution example using SME2 Indirect Matmul Kernels
Fixes:
- Fix issue where kai_get_m_step() returns the incorrect value for kernels
  - matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
  - matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla
- Fix issue with negative values handling in kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon

Assets 4

30 Mar 05:39

EmilOhlssonARM

v1.9.0

2470882

v1.9.0

Extend support for signed 4-bit integer inputs in kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon.
Add imatmul documentation
Better out-of-bounds access detection support in testing framework.
New SME2 micro-kernels:
- Matrix multiplication (1xN) of QAI8DX LHS and QSI8CX RHS to produce F32 output.
- Matrix multiplication (MxN) of QAI8DX LHS and QSI8CX RHS to produce F32 output.
Fixes:
- Address segmentation faults in benchmarking tool.
- Fix clamping issues for FP16 and BF16 in testing framework.

Assets 4

30 Mar 05:39

EmilOhlssonARM

v1.8.0

cca02c2

v1.8.0

New Advanced SIMD micro-kernels:
- Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI8CX RHS with F16 output, optimized for FEAT_I8MM and FEAT_DotProd.
- Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI8CX RHS with F16 output, optimized for FEAT_DotProd.
New SME micro-kernels:
- Indirect matrix multiplication (MxN) of F16 input and output.
  - Packing kernels for LHS and RHS
- Indirect matrix multiplication (MxN) of F32 input and output.
  - Packing kernels for LHS and RHS
New SME2 micro-kernels:
- Indirect matrix multiplication (MxN) of F16 input and output.
  - Matrix multiplication of packed indirect LHS and packed RHS
- Indirect matrix multiplication (MxN) of F32 input and output.
  - Matrix multiplication of packed indirect LHS and packed RHS
Disable link time optimization for microkernel library

Assets 4

Releases: ARM-software/kleidiai

v1.16.0

Uh oh!

v1.15.1

Uh oh!

v1.15.0

Uh oh!

v1.14.0

Uh oh!

v1.13.0

Uh oh!

v1.12.0

Uh oh!

v1.11.0

Uh oh!

v1.10.0

Uh oh!

v1.9.0

Uh oh!

v1.8.0

Uh oh!