Skip to content

Adding support for SME1 GEMM FP32 kernel #7831

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

vgundlur
Copy link

Adds support for SME1 for GEMM FP32 Kernel

Copy link

google-cla bot commented Feb 18, 2025

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@vgundlur
Copy link
Author

could someone please help on the next steps for this PR?

@dsharlet
Copy link
Collaborator

We have this SME2 kernel already: https://github.com/google/XNNPACK/blob/master/src/pf32-gemm/pf32-gemm-32x32-minmax-neonsme2.c

If the only difference is multi-vector load/store instructions, we'd rather avoid having two almost identical kernels coming from two very different sources with different support arrangements.

Can you please look into figuring out a way to reconcile these two codepaths? Maybe send your kernel as a PR to KleidiAI, and then we can use it the way we pull in the above kernel?

@fbarchard
Copy link
Collaborator

This is just a wrapper?
xnn_pf32_gemm_minmax_ukernel_32x32__neonsme that calls xnn_pf32_gemm_minmax__asm_aarch64_neonsme?

// Wraps the xnn_pf32_gemm_minmax__asm_aarch64_neonsme
// GEMM microkernel with a name that is compatible with our tooling.
void xnn_pf32_gemm_minmax_ukernel_32x32__neonsme(
size_t m, size_t n, size_t k, const void* lhs_packed,
const void* rhs_packed, float* dst, size_t dst_stride_row,
size_t dst_stride_col,
union xnn_f32_minmax_params
minmax_params[XNN_RESTRICT XNN_MIN_ELEMENTS(1)]) {

xnn_pf32_gemm_minmax__asm_aarch64_neonsme(lhs_packed, rhs_packed, dst, (k/sizeof(float)), &minmax_params->scalar.max,
                &minmax_params->scalar.min, m, n, NULL, 0, dst_stride_row);    

}

I suspect xnn_pf32_gemm_minmax__asm_aarch64_neonsme requires KleidiAI so this will fail to build with kleidi disabled.

@vgundlur vgundlur force-pushed the sme1_gemm_support branch 2 times, most recently from 28667b8 to 1ddfdb3 Compare April 4, 2025 05:25
@vgundlur
Copy link
Author

vgundlur commented Apr 4, 2025

Hi @fbarchard ,

Yes, xnn_pf32_gemm_minmax_ukernel_32x32__neonsme is a wrapper that calls xnn_pf32_gemm_minmax__asm_aarch64_neonsme. Also, xnn_pf32_gemm_minmax__asm_aarch64_neonsme does not require kleidiAI as it is available in source form within src/pf32-gemm/gen/pf32-gemm-32x32-minmax-asm-aarch64-neonsme.S.

However, we saw few build failures when KleidiAI was disabled. Fixes for these failures are added.

@vgundlur vgundlur force-pushed the sme1_gemm_support branch from 910d272 to 28b5308 Compare April 4, 2025 06:22
@dsharlet
Copy link
Collaborator

dsharlet commented Jun 3, 2025

@vgundlur can you please address the feedback I raised in this comment?

We have this SME2 kernel already: https://github.com/google/XNNPACK/blob/master/src/pf32-gemm/pf32-gemm-32x32-minmax-neonsme2.c

If the only difference is multi-vector load/store instructions, we'd rather avoid having two almost identical kernels coming from two very different sources with different support arrangements.

Can you please look into figuring out a way to reconcile these two codepaths? Maybe send your kernel as a PR to KleidiAI, and then we can use it the way we pull in the above kernel?

AFAICT these kernels are near identical, what makes them different? Why should we have both of these kernels? And if we really need both, can we do it in a way that doesn't copy paste a large block of assembly source code? (It might look like XNNPACK has a lot of copy/pasted kernels, but these are generated from the same source code wherever possible.)

@vgundlur
Copy link
Author

vgundlur commented Jun 4, 2025

@dsharlet Thank you for commenting on the PR and sorry for the delay in responding your Query.
Agree on the comment that both the kernels are identical except for some fixes for the unsupported instructions.

We need both versions of these kernels for platforms supporting SME version 1 and Version 2. Kleidiai supports SME2 and we are pushing for SME1 support.
Actually, we have tried pushing directly to KleidiAI but there are contribution restrictions to the Kernels in KleidiAI and hence we were unable to push to that repo, Please see details here: https://gitlab.arm.com/kleidi/kleidiai/-/blob/main/CONTRIBUTING.md

We will try including the Original Kaleidiai implementation into our file and replace the unsupported instructions with a macro that modifies to supported instructions, Hope this will be good, if not Please suggest any other alternate approaches.

@vgundlur vgundlur closed this Jun 25, 2025
@vgundlur vgundlur force-pushed the sme1_gemm_support branch from 28b5308 to 90de09c Compare June 25, 2025 08:57
@vgundlur vgundlur reopened this Jun 25, 2025
@vgundlur
Copy link
Author

@dsharlet, arm has pushed sme1 GEMM kernel to Kleidiai and we have pulled the GEMM kernel from it and updated our PR. Please help review the change and share your comments.

@gonnet gonnet self-requested a review June 25, 2025 10:46
@@ -18,8 +18,8 @@ ENDIF()
# LINT.IfChange
INCLUDE(ExternalProject)
ExternalProject_Add(kleidiai
URL https://github.com/ARM-software/kleidiai/archive/247088200c679f30b1b4a680bd12fee18457a100.zip
URL_HASH SHA256=ad04cc186b12810ecde9d75911c76a0113d3c055773c700377de302eef6c4419
URL https://github.com/ARM-software/kleidiai/archive/c80d18838af1ecf68f011625db103de497b5a840.zip
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also update the hashes in the WORKSPACE and MODULE files for the bazel builds?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -323,6 +323,22 @@ static void init_pf32_gemm_config(void) {
pf32_gemm_config.nr = nr;
#endif // XNN_ENABLE_ARM_SME2
}
if(XNN_ENABLE_ARM_SME && (hardware_config->arch_flags & xnn_arch_arm_sme) && !(hardware_config->arch_flags & xnn_arch_arm_sme2)){
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you explicitly checking for !(hardware_config->arch_flags & xnn_arch_arm_sme2)?

Would it not make more sense to make this an else if statement instead?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

const struct xnn_hardware_config* hardware_config = xnn_init_hardware_config();
assert(hardware_config != NULL);
if ((hardware_config->arch_flags & xnn_arch_arm_sme2)) {
if ((hardware_config->arch_flags & xnn_arch_arm_sme2) || (hardware_config->arch_flags & xnn_arch_arm_sme)) {
x32_pack_lh_config.pack_lh_fn = (xnn_pack_lh_ukernel_fn) xnn_x32_pack_lh_ukernel__neonsme2;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these packing kernels only require sme (and not sme2), then they should be renamed so that they are automatically added to the list of neonsme_microkernels.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, do we need to test for sme or sme2? Doesn't the availability of sme2 imply the availability of sme?

Copy link
Author

@vgundlur vgundlur Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The microkernel xnn_x32_pack_lh_ukernel__neonsme2 internally invokes kai_get_lhs_packed_size_lhs_pack_f32p2vlx1_f32_sme wich is based on SME1 and not SME2. So, as mentioned, the change has been brought in by renaming sme2 to sme.
Regarding the second point, availability of sme2 guarantees the availability of sme. But the reverse is not true. Hence, another sme2 check has been replaced with sme.

@@ -112,7 +112,7 @@ enum xnn_status reshape_pack_lh(xnn_operator_t pack_lh_op, size_t num_groups,
return xnn_status_success;
}

const uint32_t mr_packed = batch_size == 1 ? 1
const uint32_t mr_packed = batch_size == 1 ? (gemm_config->arch == xnn_arch_arm_sme ? gemm_config->mr : 1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is because there is no neonsme kernel for mr=1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please run clang-format on this (and all other) file.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

} else {
kai_run_lhs_pack_f32p2vlx1_f32_sme(m, k, mr_packed, kr, sr, m_idx_start,
lhs, lhs_stride, lhs_packed);
const struct xnn_hardware_config* hardware_config = xnn_init_hardware_config();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is kind of ugly. Would it be easier to add a 1x32 microkernel as well and not have to do this?

Copy link
Author

@vgundlur vgundlur Jun 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require another pf32 gemm microkernel for mr = 1 (just like xnn_pf32_gemm_minmax_ukernel_1x32__neonsme2). But arm has not added support for an SME1 variant for the kernel kai_run_matmul_clamp_f32_f32_f32p2vlx1b_1x16vl_sme2_mla (which is invoked in xnn_pf32_gemm_minmax_ukernel_1x32__neonsme2) in kleidiai.

@@ -63,7 +63,7 @@ void VBinaryMicrokernelTester::Test(xnn_f16_vbinary_ukernel_fn vbinary,

// Verify results.
for (size_t i = 0; i < batch_size(); i++) {
if (std::isnan(y_ref[i])) {
if (/*std::isnan(y_ref[i])*/1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required, hence removed

@@ -110,7 +110,7 @@ void VBinaryMicrokernelTester::Test(xnn_f32_vbinary_ukernel_fn vbinary,

// Verify results.
for (size_t i = 0; i < batch_size(); i++) {
if (std::isnan(y_ref[i])) {
if (/*std::isnan(y_ref[i])*/1) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this change necessary?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not required, hence removed

@vgundlur vgundlur force-pushed the sme1_gemm_support branch from 44e77a4 to 8f2a4c4 Compare June 26, 2025 03:25
@vgundlur vgundlur force-pushed the sme1_gemm_support branch from 8f2a4c4 to cc254df Compare June 26, 2025 09:00
@vgundlur vgundlur requested a review from gonnet June 27, 2025 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants