Releases · ARM-software/kleidiai

12 Jun 13:11

SuhailMunshi

v1.26.0

dc50c2e

v1.26.0 Latest

Latest

New SME micro-kernels
- Added an x8 matmul pack micro-kernel family with 4vsx4 blocked layout, without packing bias.
- Added SME RHS depthwise packing kernel for FP16
New SME2 micro-kernels
- Added SME2 depthwise indirect micro-kernel for FP16.
- kai_matmul_i32_u8p4vsx4_u8p4vsx4_i32_i32_8vsx8vs_sme2_mopa.
- kai_matmul_clamp_f32_u8p4vsx4_u8p4vsx4_i32_i32_f32_f32_8vsx8vs_sme2_mopa.
Extended the following micro-kernels to support variable block length
- kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4x4_1x4_neon_dotprod
- kai_matmul_clamp_f32_qsi8d32p1x8_qsi4c32p4x8_1x4x32_neon_dotprod
- kai_matmul_clamp_f32_qsi8d32p4x4_qsi4c32p4x4_16x4_neon_dotprod
- kai_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_8x4x32_neon_i8mm
- kai_matmul_clamp_f32_qsi8d32p4x8_qsi4c32p4x8_16x4_neon_i8mm
- kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0
- kai_lhs_quant_pack_qsi8d32p4x8sb_f32_neon
Fixes
- Added ZA lazy save to kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme_dot
- Fix QAI8/QSI8CXP matmul test failures by constraining generated qsi32 bias values to preserve int32 accumulator headroom.
- Fix a clamping issue in matmul_clamp_qai8_qai8p_qai8p_test.cpp
- Fix traditional matmul and imatmul packed offset helpers to use packing panel boundaries.
New Advanced SIMD micro-kernels
- Matrix Multiplication MxN and 1xN Micro-Kernels of QAI8DXP LHS and QSU2CXP RHS with F32 output, optimized for FEAT_DotProd, along with RHS packing kernel.
Documentation
- Contribution policy updates as part of third party contribution enablement
- Added coding standard and conventions
New Transposed-B RHS packing micro-kernel versions of kai_rhs_pack_kxn_x32p16x1b_x32_x32_neon and kai_rhs_pack_kxn_x16p32x1b_x16_x16_neon:
- kai_rhs_pack_nxk_x16p32x1bx16_x16_x16_neon
- kai_rhs_pack_nxk_x32p16x1bx32_x32_x32_neon
New SME2 FP32 GEMV micro-kernel with 4vsx1 RHS format
New SME2 static Int8 GEMM/GEMV kernels and the RHS packing kernel.
- kai_matmul_clamp_qai8_qai8p4vsx4_qsi8cxp4vsx4bi32sf32_8vsx8vs_sme2_mopa
- kai_matmul_clamp_qai8_qai8_qsi8cxp4vsx4bi32sf32_1x32vs_sme2_dot
- kai_matmul_pack_rhs_kxn_qsi8cxp4vsx4bi32sf32_qsi8_i32_f32_sme

Assets 4

11 May 08:24

james-gross-arm

v1.25.0

85862cd

v1.25.0

Optimizations
- Optimize rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon for Int4 GEMM/GEMV kernels
Fixes
- Fix RHS tail handling in kai_matmul_clamp_f32_qai8dxp1vlx4_qsi4cxp4vlx4_1vlx4vl_sme_mopa.

Assets 4

23 Apr 09:12

EmilOhlssonARM

v1.24.0

0b2ee51

v1.24.0

New SME micro-kernels
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI4CXP RHS with F32 output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI4CXP RHS with F32 output.

Assets 4

30 Mar 05:42

EmilOhlssonARM

v1.23.0

47b7761

v1.23.0

New SME2 micro-kernels
- Add SME2 elastic GEMM micro-kernels.
  - The micro-kernel consists of a primary micro-kernel with 8 vscale * 8 vscale (2VLx2VL)
    output block and other micro-kernels with different output block to handle the edges.
  - Data type: FP32.
  - Instruction: SME2 MOPA.
  - New naming rule and API design are introduced with the elastic GEMM micro-kernel.
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI8CXP RHS with F16 output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI8CXP RHS with F16 output.
Extended the following kernels to support variable block length
- kai_rhs_pack_nxk_qsi4c32ps1s0scalef16_qsu4c32s16s0_neon
- kai_lhs_quant_pack_qsi8d32p_f32_neon
- kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
- kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot
Documentation
- Added overview of micro-kernels
Fixes
- Update the kai_matmul_clamp_f32_qai8dxp1vlx8_qsi4cxp4vlx8_1vlx4vl_sme2_mopa kernel to multiply the zero-points and row-sums as integers instead of float to improve accuracy.
- Implement clamping in kernels where it was missing to match their naming
  - kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4x4_1x4_neon_dotprod
  - kai_matmul_clamp_f32_qsi8d32p4x4_qsi4c32p4x4_16x4_neon_dotprod
  - kai_matmul_clamp_f32_qsi8d32p1vlx4_qsi4c32p4vlx4_1vlx4vl_sme2_mopa
  - kai_matmul_clamp_f32_qsi8d32p1x4_qsi4c32p4vlx4_1x4vl_sme2_sdot

Assets 4

30 Mar 05:42

EmilOhlssonARM

v1.22.0

98a6df7

v1.22.0

New SME2 micro-kernels
- Matrix Multiplication (MxN) Micro-Kernels of F16 LHS and QSI4C32P RHS with F32 input and output.
New Advanced SIMD packing kernel for F16PMRX2 LHS
Optimizations
- Optimize kai_rhs_pack_nxk_qsu2cxp4vlx4_qsu2cx_neon and kai_lhs_quant_pack_qai8dxp_f32 further for the block depth (kr/sr = 4) with Advanced SIMD
Fixes
- Minor fix to the packing parameters(kr and sr) in matmul_clamp_f32_qai8dxp_qsu2cxp kernels.
- Fix the vectorized round off instruction in kai_lhs_quant_pack_qai8dxp_f32 for the block depth (kr/sr = 8) to be consistent with the scalar implementation.

Assets 4

30 Mar 05:42

EmilOhlssonARM

v1.21.0

ff361c3

v1.21.0

New SME2 micro-kernels
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSU2CXP RHS with F32 output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSU2CXP RHS with F32 output.
New Packing kernel for QSU2CXP RHS.
New SVE micro-kernels
- Indirect matrix multiplication (MxN) Micro-Kernel of F32 input and output.
Added benchmarking for SVE indirect matrix multiplication (imatmul)
Add Bazel 9 support
Fixes
- Addressed undefined behavior affecting polymorphic handling in test framework
- Fixed GTest random seed to prevent flaky tests

Assets 4

30 Mar 05:42

EmilOhlssonARM

v1.20.0

4d2725c

v1.20.0

Added code to cater for no exception compilation
Extended the benchmarking suite to support depthwise convolution benchmarking
- Added kai_dwconv_clamp_f32_f32_f32p1vlx1b_3x3_s1_4xc_sme2_mla to benchmarking
Fixes
- Fixed clamping not being applied in matmul_clamp_f32_bf16p1x4_bf16p12x4b_1x36_neon_dot

Assets 4

30 Mar 05:42

EmilOhlssonARM

v1.19.0

7d82645

v1.19.0

Added new unit test framework
New SME2 micro-kernels
- Matrix Multiplication (1xN) Micro-Kernel of QAI8DXP LHS and QSI4C32P RHS with F32 input and output.
- Matrix Multiplication (MxN) Micro-Kernel of QAI8DXP LHS and QSI4C32P RHS with F32 input and output.
Add clang-cl compiler macros for kai_matmul_clamp_f32_f32_f32p8x1biasf32_6x8x4_neon_mla_asm micro-kernel
Fixes
- Fix incorrect API in BF16 kernel interface

Assets 4

30 Mar 05:41

EmilOhlssonARM

v1.18.0

eb1e608

v1.18.0

Fixes
- Add Null Bias support for rhs_pack_kxn_x16p32x1b_x16_x16_neon.
- Updated description of matmul file name from m_step x n_step to m_block x n_block
- Clamp after scaling in matmul_clamp_f32_qai8dxp1vlx4_qsi8cxp4vlx4_1vlx4vl_sme_mopa.
New SVE micro-kernels
- Matrix Multiplication (MxN) Micro-Kernels with F32 input and output.
Update the example matmul_clamp_f32_qsi8d32p_qsi4c32p to demonstrate how a micro-kernel can be used in a multithreaded environment.
Documentation
- Update documentation to use markdown syntax
- Added FAQ
- Added a section that describes the directory structure and a section that describes the different example implementations

Assets 4

30 Mar 05:41

EmilOhlssonARM

v1.17.0

94d6cc4

v1.17.0

Fixes
- Add Null Bias support for rhs_pack_kxn_x32p16x1b_x32_x32_neon.
- Some micro-kernels report incorrect m_step value.
  - kai_lhs_quant_pack_qai8dxp_bf16_neon
  - kai_lhs_quant_pack_qai8dxp_f16_neon
  - kai_lhs_quant_pack_qai8dxp_f32
  - kai_lhs_quant_pack_qsi8d32p4x8sb_f32_neon
  - kai_lhs_quant_pack_qsi8d32p_f32
  - kai_lhs_quant_pack_qsi8d32p_f32_neon
  - kai_lhs_quant_pack_qsi8d32pscalef32_f16_neon
  - kai_lhs_quant_pack_qsi8d32pscalef32_f32_neon

Assets 4

Releases: ARM-software/kleidiai

v1.26.0

Uh oh!

v1.25.0

Uh oh!

v1.24.0

Uh oh!

v1.23.0

Uh oh!

v1.22.0

Uh oh!

v1.21.0

Uh oh!

v1.20.0

Uh oh!

v1.19.0

Uh oh!

v1.18.0

Uh oh!

v1.17.0

Uh oh!