Releases · ARM-software/kleidiai

11 May 08:24

james-gross-arm

v1.25.0

85862cd

v1.25.0 Latest

Latest

Optimizations
- Optimize rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon for Int4 GEMM/GEMV kernels
Fixes
- Fix RHS tail handling in kai_matmul_clamp_f32_qai8dxp1vlx4_qsi4cxp4vlx4_1vlx4vl_sme_mopa.

Assets 4

23 Apr 09:12

EmilOhlssonARM

v1.24.0

0b2ee51

v1.24.0

New SME micro-kernels
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI4CXP RHS with F32 output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI4CXP RHS with F32 output.

Assets 4

30 Mar 05:42

EmilOhlssonARM

v1.23.0

47b7761

v1.23.0

New SME2 micro-kernels
- Add SME2 elastic GEMM micro-kernels.
  - The micro-kernel consists of a primary micro-kernel with 8 vscale * 8 vscale (2VLx2VL)
    output block and other micro-kernels with different output block to handle the edges.
  - Data type: FP32.
  - Instruction: SME2 MOPA.
  - New naming rule and API design are introduced with the elastic GEMM micro-kernel.
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI8CXP RHS with F16 output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI8CXP RHS with F16 output.
Extended the following kernels to support variable block length
- kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
- kai_lhs_quant_pack_qsi8d32p_f32_neon
- kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
- kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot
Documentation
- Added overview of micro-kernels
Fixes
- Update the kai_matmul_clamp_f32_qai8dxp1vlx8_qsi4cxp4vlx8_1vlx4vl_sme2_mopa kernel to multiply the zero-points and row-sums as integers instead of float to improve accuracy.
- Implement clamping in kernels where it was missing to match their naming
  - kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4x4_1x4_neon_dotprod
  - kai_matmul_clamp_f32_qsi8d32p4x4_qsi4c32p4x4_16x4_neon_dotprod
  - kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
  - kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot

Assets 4

30 Mar 05:42

EmilOhlssonARM

v1.22.0

98a6df7

v1.22.0

New SME2 micro-kernels
- Matrix Multiplication (MxN) Micro-Kernels of F16 LHS and QSI4C32P RHS with F32 input and output.
New Advanced SIMD packing kernel for F16PMRX2 LHS
Optimizations
- Optimize kai_rhs_pack_nxk_qsu2cxp4vlx4_qsu2cx_neon and kai_lhs_quant_pack_qai8dxp_f32 further for the block depth (kr/sr = 4) with Advanced SIMD
Fixes
- Minor fix to the packing parameters(kr and sr) in matmul_clamp_f32_qai8dxp_qsu2cxp kernels.
- Fix the vectorized round off instruction in kai_lhs_quant_pack_qai8dxp_f32 for the block depth (kr/sr = 8) to be consistent with the scalar implementation.

Assets 4

30 Mar 05:42

EmilOhlssonARM

v1.21.0

ff361c3

v1.21.0

New SME2 micro-kernels
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSU2CXP RHS with F32 output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSU2CXP RHS with F32 output.
New Packing kernel for QSU2CXP RHS.
New SVE micro-kernels
- Indirect matrix multiplication (MxN) Micro-Kernel of F32 input and output.
Added benchmarking for SVE indirect matrix multiplication (imatmul)
Add Bazel 9 support
Fixes
- Addressed undefined behavior affecting polymorphic handling in test framework
- Fixed GTest random seed to prevent flaky tests

Assets 4

30 Mar 05:42

EmilOhlssonARM

v1.20.0

4d2725c

v1.20.0

Added code to cater for no exception compilation
Extended the benchmarking suite to support depthwise convolution benchmarking
- Added kai_dwconv_clamp_f32_f32_f32p1vlx1b_3x3_s1_4xc_sme2_mla to benchmarking
Fixes
- Fixed clamping not being applied in matmul_clamp_f32_bf16p1x4_bf16p12x4b_1x36_neon_dot

Assets 4

30 Mar 05:42

EmilOhlssonARM

v1.19.0

7d82645

v1.19.0

Added new unit test framework
New SME2 micro-kernels
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI4C32P RHS with F32 input and output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI4C32P RHS with F32 input and output.
Add clang-cl compiler macros for kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla_asm micro-kernel
Fixes
- Fix incorrect API in BF16 kernel interface

Assets 4

30 Mar 05:41

EmilOhlssonARM

v1.18.0

eb1e608

v1.18.0

Fixes
- Add Null Bias support for rhs_pack_kxn_x16p32x1b_x16_x16_neon.
- Updated description of matmul file name from m_step x n_step to m_block x n_block
- Clamp after scaling in matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme_mopa.
New SVE micro-kernels
- Matrix Multiplication (MxN) Micro-Kernels with F32 input and output.
Update the example matmul_clamp_f32_qsi8d32p_qsi4c32p to demonstrate how a micro-kernel can be used in a multithreaded environment.
Documentation
- Update documentation to use markdown syntax
- Added FAQ
- Added a section that describes the directory structure and a section that describes the different example implementations

Assets 4

30 Mar 05:41

EmilOhlssonARM

v1.17.0

94d6cc4

v1.17.0

Fixes
- Add Null Bias support for rhs_pack_kxn_x32p16x1b_x32_x32_neon.
- Some micro-kernels report incorrect m_step value.
  - kai_lhs_quant_pack_qai8dxp_bf16_neon
  - kai_lhs_quant_pack_qai8dxp_f16_neon
  - kai_lhs_quant_pack_qai8dxp_f32
  - kai_lhs_quant_pack_qsi8d32p4x8sb_f32_neon
  - kai_lhs_quant_pack_qsi8d32p_f32
  - kai_lhs_quant_pack_qsi8d32p_f32_neon
  - kai_lhs_quant_pack_qsi8d32pscalef32_f16_neon
  - kai_lhs_quant_pack_qsi8d32pscalef32_f32_neon

Assets 4

30 Mar 05:41

EmilOhlssonARM

v1.16.0

84796ec

v1.16.0

Extended the benchmarking framework to support multiple operators.
- Initial support for matrix multiplication (matmul) & indirect matrix multiplication (imatmul)
- Added all imatmul and matmul micro-kernels to the benchmark suite
Fixes:
- All SME and SME2 micro-kernels now commit ZA lazy save buffer when building with SME support.
- Fixed incorrect handling of zero point and scale into two packing kernels which caused incorrect de-quantisation is certain cases:
  - kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s0s1_f32_f32_f32_neon
  - kai_rhs_pack_nxk_qai4c32ps1s0nrx4_qau4c32s1s0_f32_f32_f32_neon
NEW SVE micro-kernels (256-bit Vector length specific):
- Matrix multiplication (MxN) Micro-kernels of QSI8DX LHS and QSI4CX RHS with F32 input and output.
- Matrix multiplication (1xN) Micro-kernels of QSI8DX LHS and QSI4CX RHS with F32 input and output.

Assets 4

Releases: ARM-software/kleidiai

v1.25.0

Uh oh!

v1.24.0

Uh oh!

v1.23.0

Uh oh!

v1.22.0

Uh oh!

v1.21.0

Uh oh!

v1.20.0

Uh oh!

v1.19.0

Uh oh!

v1.18.0

Uh oh!

v1.17.0

Uh oh!

v1.16.0

Uh oh!