Releases: ARM-software/kleidiai
Releases · ARM-software/kleidiai
v1.16.0
- Extended the benchmarking framework to support multiple operators.
- Initial support for matrix multiplication (matmul) & indirect matrix multiplication (imatmul)
- Added all imatmul and matmul micro-kernels to the benchmark suite
- Fixes:
- All SME and SME2 micro-kernels now commit ZA lazy save buffer when building with SME support.
- Fixed incorrect handling of zero point and scale into two packing kernels which caused incorrect de-quantisation is certain cases:
- kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s0s1_f32_f32_f32_neon
- kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s1s0_f32_f32_f32_neon
- NEW SVE micro-kernels (256-bit Vector length specific):
- Matrix multiplication (MxN) Micro-kernels of QSI8DX LHS and QSI4CX RHS with F32 input and output.
- Matrix multiplication (1xN) Micro-kernels of QSI8DX LHS and QSI4CX RHS with F32 input and output.
v1.15.1
- Fixes
- Added missing checks for bf16 support for quantised matmuls with bf16 input/output.
v1.15.0
- New SME micro-kernels:
- Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI8CX RHS with F32 input and output.
- Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI8CX RHS with F32 input and output.
- Wider compiler compatibility for the following kernels:
- kai_matmul_clamp_f16_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa
- kai_matmul_clamp_f16_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dot
- kai_matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
- kai_matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla
- kai_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa
- kai_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa
- kai_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme_mopa
- kai_matmul_clamp_f32_qai8dxp1x4_qsi8cxp4vlx4_1x4vl_sme2_dot
- kai_matmul_clamp_f32_qai8dxp1x4_qsi8cxp4vlx4_1x4vl_sme_dot
- kai_matmul_clamp_f32_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa
- kai_matmul_clamp_f32_qsi8d32p1x4_qai4c32p4vlx4_1x4vl_sme2_dot
- kai_matmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa
v1.14.0
- New SME micro-kernels:
- Indirect matrix multiplication (MxN) of QAI8 input and output.
- Indirect matrix multiplication (MxN) of F16 input and output.
- Indirect matrix multiplication (MxN) of F32 input and output.
- Matrix multiplication (MxN) of QAI8 LHS and RHS with QAI8 output.
- Depthwise Convolution RHS F32 Packing kernel.
- New SME2 micro-kernels:
- Depthwise Convolution (3x3) Planar kernel of F32 LHS and Packed F32 RHS with F32 output using MLA.
- Convert SME2 matmul micro-kernels to pure assembly, and add MSVC support.
- Affects: kai_matmul_clamp_f32_bf16p2vlx2_bf16p2vlx2_2vlx2vl_sme2_mopa
- Optimizations:
- Packing functions kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s1s0_f32_f32_f32_neon and kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s0s1_f32_f32_f32_neon have been further optimized.
- Packing function kai_lhs_quant_pack_qai8dxp_f16_neon has been further optimized.
- New Advanced SIMD micro-kernels:
- Wider 6x32 block size variants of FP16 Matrix Multiplication, including a variant optimized for the Arm® Cortex®-A55 processor.
- Wider 6x16 block size variants of FP32 Matrix Multiplication, including a variant optimized for the Arm® Cortex®-A55 processor.
- Fixes:
- Fix out-of-bound read of intermediate values in kai_matmul_clamp_f16_qsi8d32p1vlx4_qai4c32p4vlx4_1vlx4vl_sme2_mopa micro-kernel
- Fix out-of-bounds write in kai_matmul_clamp_f16_f16_f16p2vlx2b_1x8vl_sme_mla
- Fix out-of-bounds read in kai_matmul_clamp_qai8_qai8_qsi8cxp2vlx4sb_1x16vl_sme2_dot
v1.13.0
- Improve performance of lhs_quant_pack_qsi8d32p_f32 using Advanced SIMD reimplemented as lhs_quant_pack_qsi8d32p4x8sb_f32_neon.
- New SME2 micro-kernels:
- Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_SME2.
- Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_SME2.
- Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_SME2.
- Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_SME2.
v1.12.0
- New Advanced SIMD micro-kernels:
- Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with BF16 output, optimized for FEAT_I8MM.
- Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI4CX RHS with BF16 output, optimized for FEAT_DotProd.
- Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI4C32 RHS with BF16 output, optimized for FEAT_I8MM.
- Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI4C32 RHS with BF16 output, optimized for FEAT_DotProd.
- New SME micro-kernels:
- Matrix multiplication (1xN) of F32 LHS and RHS with F32 output, using instructions compatible with FEAT_SME.
- Matrix multiplication (1xN) of F16 LHS and RHS with F16 output, using instructions compatible with FEAT_SME.
- Convert SME transposed RHS packing micro-kernels to pure assembly:
- kai_rhs_pack_nxk_f32p2vlx1biasf32_f32_f32_sme
- kai_rhs_pack_nxk_x16p2vlx2b_x16_x16_sme
- Include more micro-kernels in MSVC build:
- kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla
- kai_lhs_quant_pack_qsi8d32p_f32_neon
- kai_rhs_pack_kxn_qsi8cxp_qsi8cx_neon
- kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
- kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon
- kai_rhs_pack_nxk_qsi8cxp_qsi8cx_neon
- Fixes
- Update kai_kernel_matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme2_mopa to improve accuracy
- Convert common SME/SME2 code into assembly file kai_common_sme_asm.S
- Documentation
- Added ONNX Runtime MLAS library integration example.
v1.11.0
- New Advanced SIMD micro-kernels:
- Optimized version of kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 kernel for block depth of 4 bytes (
kai_rhs_pack_nxk_qsi4c32pnrx4_qsu4c32s1s0_neon)
- Optimized version of kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 kernel for block depth of 4 bytes (
- Improve performance of
kai_rhs_pack_nxk_qsi4c32pnrx8_qsu4c32s1s0_neon
v1.10.0
- Convert SME and SME2 imatmul micro-kernels to use pure assembly, and add MSVC support. Affects:
- kai_imatmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa
- kai_imatmul_clamp_f32_f32p2vlx1_f32p2vlx1b_2vlx2vl_sme2_mopa
- kai_imatmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa
- kai_lhs_imatmul_pack_x16p2vlx2_x16p_sme
- kai_lhs_imatmul_pack_x32p2vlx1_x32p_sme
- kai_lhs_imatmul_pack_x8p2vlx4_x8p_sme
- kai_rhs_imatmul_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme
- kai_rhs_imatmul_pack_kxn_x16p2vlx2b_x16_x16_sme
- kai_rhs_imatmul_pack_kxn_x32p2vlx1b_x32_x32_sme
- Convert SME and SME2 matmul micro-kernels to pure assembly, and add MSVC support. Affects:
- kai_lhs_pack_f32p2vlx1_f32_sme
- kai_lhs_pack_x16p2vlx2_x16_sme
- kai_lhs_pack_x8p2vlx4_x8_sme
- kai_matmul_clamp_f16_f16_f16p2vlx2b_1x16vl_sme2_dot
- kai_matmul_clamp_f16_f16p2vlx2_f16p2vlx2_2vlx2vl_sme2_mopa
- kai_matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
- kai_matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla
- kai_matmul_clamp_f32_f32p2vlx1_f32p2vlx1biasf32_sme2_mopa
- kai_matmul_clamp_qai8_qai8_qsi8cxp2vlx4sb_1x16vl_sme2_dot
- kai_matmul_clamp_qai8_qai8p2vlx4_qsi8cxpsb2vlx4_2vlx2vl_sme2_mopa
- kai_rhs_pack_kxn_f32p16vlx1b_f32_f32_sme
- kai_rhs_pack_kxn_f32p2vlx1biasf32_f32_f32_sme
- kai_rhs_pack_kxn_qsi8cxp2vlx4sb_qs8cx_f32_i32_sme
- kai_rhs_pack_kxn_x16p2vlx2b_x16_x16_sme
- New Advanced SIMD micro-kernels:
- Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_DotProd.
- Matrix multiplication (MxN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_DotProd.
- Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F32 output, optimized for FEAT_DotProd.
- Matrix multiplication (1xN) Micro-kernels of QSI8D32 LHS and QAI4C32 RHS with F16 output, optimized for FEAT_DotProd.
- Optimized version of kai_rhs_pack_nxk_qsi4c32p_qsu4c32s1s0 kernel for block depth of 8 bytes (
kai_rhs_pack_nxk_qsi4c32pnrx8_qsu4c32s1s0_neon)
- New SME micro-kernels:
- Added GEMM F16 and F32 kernels using SME1 MOPA instruction, block size 2VLx2VL.
- Added Convolution example using SME2 Indirect Matmul Kernels
- Fixes:
- Fix issue where kai_get_m_step() returns the incorrect value for kernels
- matmul_clamp_f32_f32_f32p16vlx1b_1x16vl_sme2_mla
- matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla
- Fix issue with negative values handling in kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon
- Fix issue where kai_get_m_step() returns the incorrect value for kernels
v1.9.0
- Extend support for signed 4-bit integer inputs in
kai_rhs_pack_nxk_qsi4cxps1s0_qsu4cxs1s0_neon. - Add imatmul documentation
- Better out-of-bounds access detection support in testing framework.
- New SME2 micro-kernels:
- Matrix multiplication (1xN) of QAI8DX LHS and QSI8CX RHS to produce F32 output.
- Matrix multiplication (MxN) of QAI8DX LHS and QSI8CX RHS to produce F32 output.
- Fixes:
- Address segmentation faults in benchmarking tool.
- Fix clamping issues for FP16 and BF16 in testing framework.
v1.8.0
- New Advanced SIMD micro-kernels:
- Matrix multiplication (MxN) Micro-kernels of QAI8DX LHS and QSI8CX RHS with F16 output, optimized for FEAT_I8MM and FEAT_DotProd.
- Matrix multiplication (1xN) Micro-kernels of QAI8DX LHS and QSI8CX RHS with F16 output, optimized for FEAT_DotProd.
- New SME micro-kernels:
- Indirect matrix multiplication (MxN) of F16 input and output.
- Packing kernels for LHS and RHS
- Indirect matrix multiplication (MxN) of F32 input and output.
- Packing kernels for LHS and RHS
- Indirect matrix multiplication (MxN) of F16 input and output.
- New SME2 micro-kernels:
- Indirect matrix multiplication (MxN) of F16 input and output.
- Matrix multiplication of packed indirect LHS and packed RHS
- Indirect matrix multiplication (MxN) of F32 input and output.
- Matrix multiplication of packed indirect LHS and packed RHS
- Indirect matrix multiplication (MxN) of F16 input and output.
- Disable link time optimization for microkernel library