Adding support for SME1 GEMM FP32 kernel #7831

vgundlur · 2025-02-18T09:04:14Z

Adds support for SME1 for GEMM FP32 Kernel

google-cla · 2025-02-18T09:04:18Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

vgundlur · 2025-02-24T07:25:28Z

could someone please help on the next steps for this PR?

dsharlet · 2025-03-27T20:56:21Z

We have this SME2 kernel already: https://github.com/google/XNNPACK/blob/master/src/pf32-gemm/pf32-gemm-32x32-minmax-neonsme2.c

If the only difference is multi-vector load/store instructions, we'd rather avoid having two almost identical kernels coming from two very different sources with different support arrangements.

Can you please look into figuring out a way to reconcile these two codepaths? Maybe send your kernel as a PR to KleidiAI, and then we can use it the way we pull in the above kernel?

fbarchard · 2025-04-02T02:30:21Z

This is just a wrapper?
xnn_pf32_gemm_minmax_ukernel_32x32__neonsme that calls xnn_pf32_gemm_minmax__asm_aarch64_neonsme?

// Wraps the xnn_pf32_gemm_minmax__asm_aarch64_neonsme
// GEMM microkernel with a name that is compatible with our tooling.
void xnn_pf32_gemm_minmax_ukernel_32x32__neonsme(
size_t m, size_t n, size_t k, const void* lhs_packed,
const void* rhs_packed, float* dst, size_t dst_stride_row,
size_t dst_stride_col,
union xnn_f32_minmax_params
minmax_params[XNN_RESTRICT XNN_MIN_ELEMENTS(1)]) {

xnn_pf32_gemm_minmax__asm_aarch64_neonsme(lhs_packed, rhs_packed, dst, (k/sizeof(float)), &minmax_params->scalar.max,
                &minmax_params->scalar.min, m, n, NULL, 0, dst_stride_row);

}

I suspect xnn_pf32_gemm_minmax__asm_aarch64_neonsme requires KleidiAI so this will fail to build with kleidi disabled.

vgundlur · 2025-04-04T06:09:07Z

Hi @fbarchard ,

Yes, xnn_pf32_gemm_minmax_ukernel_32x32__neonsme is a wrapper that calls xnn_pf32_gemm_minmax__asm_aarch64_neonsme. Also, xnn_pf32_gemm_minmax__asm_aarch64_neonsme does not require kleidiAI as it is available in source form within src/pf32-gemm/gen/pf32-gemm-32x32-minmax-asm-aarch64-neonsme.S.

However, we saw few build failures when KleidiAI was disabled. Fixes for these failures are added.

dsharlet · 2025-06-03T07:55:58Z

@vgundlur can you please address the feedback I raised in this comment?

We have this SME2 kernel already: https://github.com/google/XNNPACK/blob/master/src/pf32-gemm/pf32-gemm-32x32-minmax-neonsme2.c

If the only difference is multi-vector load/store instructions, we'd rather avoid having two almost identical kernels coming from two very different sources with different support arrangements.

Can you please look into figuring out a way to reconcile these two codepaths? Maybe send your kernel as a PR to KleidiAI, and then we can use it the way we pull in the above kernel?

AFAICT these kernels are near identical, what makes them different? Why should we have both of these kernels? And if we really need both, can we do it in a way that doesn't copy paste a large block of assembly source code? (It might look like XNNPACK has a lot of copy/pasted kernels, but these are generated from the same source code wherever possible.)

vgundlur · 2025-06-04T07:47:33Z

@dsharlet Thank you for commenting on the PR and sorry for the delay in responding your Query.
Agree on the comment that both the kernels are identical except for some fixes for the unsupported instructions.

We need both versions of these kernels for platforms supporting SME version 1 and Version 2. Kleidiai supports SME2 and we are pushing for SME1 support.
Actually, we have tried pushing directly to KleidiAI but there are contribution restrictions to the Kernels in KleidiAI and hence we were unable to push to that repo, Please see details here: https://gitlab.arm.com/kleidi/kleidiai/-/blob/main/CONTRIBUTING.md

We will try including the Original Kaleidiai implementation into our file and replace the unsupported instructions with a macro that modifies to supported instructions, Hope this will be good, if not Please suggest any other alternate approaches.

vgundlur · 2025-06-25T09:11:20Z

@dsharlet, arm has pushed sme1 GEMM kernel to Kleidiai and we have pulled the GEMM kernel from it and updated our PR. Please help review the change and share your comments.

gonnet · 2025-06-25T10:49:09Z

cmake/DownloadKleidiAI.cmake

@@ -18,8 +18,8 @@ ENDIF()
 # LINT.IfChange
 INCLUDE(ExternalProject)
 ExternalProject_Add(kleidiai
-  URL https://github.com/ARM-software/kleidiai/archive/247088200c679f30b1b4a680bd12fee18457a100.zip
-  URL_HASH SHA256=ad04cc186b12810ecde9d75911c76a0113d3c055773c700377de302eef6c4419
+  URL https://github.com/ARM-software/kleidiai/archive/c80d18838af1ecf68f011625db103de497b5a840.zip


Also update the hashes in the WORKSPACE and MODULE files for the bazel builds?

gonnet · 2025-06-25T10:51:25Z

src/configs/gemm-config.c

@@ -323,6 +323,22 @@ static void init_pf32_gemm_config(void) {
      pf32_gemm_config.nr = nr;
    #endif  // XNN_ENABLE_ARM_SME2
  }
+  if(XNN_ENABLE_ARM_SME && (hardware_config->arch_flags & xnn_arch_arm_sme) && !(hardware_config->arch_flags & xnn_arch_arm_sme2)){


Why are you explicitly checking for !(hardware_config->arch_flags & xnn_arch_arm_sme2)?

Would it not make more sense to make this an else if statement instead?

gonnet · 2025-06-25T10:53:20Z

src/configs/pack-lh-config.c

  const struct xnn_hardware_config* hardware_config = xnn_init_hardware_config();
  assert(hardware_config != NULL);
-  if ((hardware_config->arch_flags & xnn_arch_arm_sme2)) {
+  if ((hardware_config->arch_flags & xnn_arch_arm_sme2) || (hardware_config->arch_flags & xnn_arch_arm_sme)) {
    x32_pack_lh_config.pack_lh_fn = (xnn_pack_lh_ukernel_fn) xnn_x32_pack_lh_ukernel__neonsme2;


If these packing kernels only require sme (and not sme2), then they should be renamed so that they are automatically added to the list of neonsme_microkernels.

Also, do we need to test for sme or sme2? Doesn't the availability of sme2 imply the availability of sme?

The microkernel xnn_x32_pack_lh_ukernel__neonsme2 internally invokes kai_get_lhs_packed_size_lhs_pack_f32p2vlx1_f32_sme wich is based on SME1 and not SME2. So, as mentioned, the change has been brought in by renaming sme2 to sme.
Regarding the second point, availability of sme2 guarantees the availability of sme. But the reverse is not true. Hence, another sme2 check has been replaced with sme.

gonnet · 2025-06-25T10:57:03Z

src/operators/pack-lh.c

@@ -112,7 +112,7 @@ enum xnn_status reshape_pack_lh(xnn_operator_t pack_lh_op, size_t num_groups,
    return xnn_status_success;
  }

-  const uint32_t mr_packed = batch_size == 1          ? 1
+  const uint32_t mr_packed = batch_size == 1          ? (gemm_config->arch == xnn_arch_arm_sme ? gemm_config->mr : 1)


I'm guessing this is because there is no neonsme kernel for mr=1?

Yes, you are right

gonnet · 2025-06-25T10:57:50Z

src/pf32-gemm/pf32-gemm-32x32-minmax-neonsme.c

Please run clang-format on this (and all other) file.

gonnet · 2025-06-25T10:59:20Z

src/x32-pack-lh/x32-packlh-neonsme2.c

-  } else {
-    kai_run_lhs_pack_f32p2vlx1_f32_sme(m, k, mr_packed, kr, sr, m_idx_start,
-                                       lhs, lhs_stride, lhs_packed);
+  const struct xnn_hardware_config* hardware_config = xnn_init_hardware_config();


This is kind of ugly. Would it be easier to add a 1x32 microkernel as well and not have to do this?

This would require another pf32 gemm microkernel for mr = 1 (just like xnn_pf32_gemm_minmax_ukernel_1x32__neonsme2). But arm has not added support for an SME1 variant for the kernel kai_run_matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla (which is invoked in xnn_pf32_gemm_minmax_ukernel_1x32__neonsme2) in kleidiai.

gonnet · 2025-06-25T11:00:13Z

test/vbinary-microkernel-tester.cc

@@ -63,7 +63,7 @@ void VBinaryMicrokernelTester::Test(xnn_f16_vbinary_ukernel_fn vbinary,

    // Verify results.
    for (size_t i = 0; i < batch_size(); i++) {
-      if (std::isnan(y_ref[i])) {
+      if (/*std::isnan(y_ref[i])*/1) {


Why is this change necessary?

This is not required, hence removed

gonnet · 2025-06-25T11:00:22Z

test/vbinary-microkernel-tester.cc

@@ -110,7 +110,7 @@ void VBinaryMicrokernelTester::Test(xnn_f32_vbinary_ukernel_fn vbinary,

    // Verify results.
    for (size_t i = 0; i < batch_size(); i++) {
-      if (std::isnan(y_ref[i])) {
+      if (/*std::isnan(y_ref[i])*/1) {


Why is this change necessary?

This is not required, hence removed

vgundlur force-pushed the sme1_gemm_support branch 2 times, most recently from 28667b8 to 1ddfdb3 Compare April 4, 2025 05:25

vgundlur force-pushed the sme1_gemm_support branch from 910d272 to 28b5308 Compare April 4, 2025 06:22

vgundlur closed this Jun 25, 2025

vgundlur force-pushed the sme1_gemm_support branch from 28b5308 to 90de09c Compare June 25, 2025 08:57

vgundlur reopened this Jun 25, 2025

gonnet self-requested a review June 25, 2025 10:46

gonnet requested changes Jun 25, 2025

View reviewed changes

vgundlur force-pushed the sme1_gemm_support branch from 44e77a4 to 8f2a4c4 Compare June 26, 2025 03:25

Added PF32 SME1 GEMM support

cc254df

vgundlur force-pushed the sme1_gemm_support branch from 8f2a4c4 to cc254df Compare June 26, 2025 09:00

Merge branch 'master' into sme1_gemm_support

bc44b48

vgundlur requested a review from gonnet June 27, 2025 01:11

Adding support for SME1 GEMM FP32 kernel #7831

Are you sure you want to change the base?

Adding support for SME1 GEMM FP32 kernel #7831

Uh oh!

Conversation

vgundlur commented Feb 18, 2025

Uh oh!

google-cla bot commented Feb 18, 2025

Uh oh!

vgundlur commented Feb 24, 2025

Uh oh!

dsharlet commented Mar 27, 2025

Uh oh!

fbarchard commented Apr 2, 2025

Uh oh!

vgundlur commented Apr 4, 2025

Uh oh!

dsharlet commented Jun 3, 2025

Uh oh!

vgundlur commented Jun 4, 2025

Uh oh!

vgundlur commented Jun 25, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vgundlur Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vgundlur Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vgundlur Jun 26, 2025 •

edited

Loading

vgundlur Jun 26, 2025 •

edited

Loading