Skip to content

SME2 igemm path causes segfault on macOS ARM64 (Apple M4) due to corrupted a_offset in igemm_context #9898

@notaJiminLee

Description

@notaJiminLee

Environment

  • macOS 15.6, Apple M4 (ARM64)
  • ExecuTorch v1.0.1 / v1.1.0 (XNNPACK commit 3131afe)
  • KleidiAI v1.11.0 (default) and v1.23.0 (latest) — both reproduce
  • XNNPACK_ENABLE_ARM_SME=ON, XNNPACK_ENABLE_ARM_SME2=ON

Summary

When XNNPACK_ENABLE_ARM_SME2=ON on macOS ARM64 (Apple M4), running quantized convolution models (e.g., ResNeXt50 INT8) causes a segfault in kai_run_lhs_imatmul_pack_x8p2vlx4_x8p_sme, called from
compute_batch_inline_packed_igemm. FP32 models work fine. Setting XNNPACK_ENABLE_ARM_SME2=OFF resolves the issue.

This is consistent with XNNPACK's own default of SME2=OFF with the comment: "Only enable this by default once we're able to test all SME2 kernels continuously."

Steps to Reproduce

  1. Build on macOS ARM64 (Apple M4) with -DXNNPACK_ENABLE_ARM_SME2=ON
  2. Run a quantized (INT8) convolution model with grouped convolutions (e.g., ResNeXt50)
  3. Segfault occurs after "Method loaded", during the first inference

lldb Analysis

Debugged the crash with lldb. The igemm_context fields at crash time:

context->a_offset = 0xffffa0013f63a730 ← PAC-corrupted (INVALID)
context->indirect_a = 0x00000001422c0000 ← valid
context->workspace = 0x000000014099e000 ← valid
context->zero = 0x00006000013018c0 ← valid
context->ga_stride = 0x4 ← valid
context->kc = 4 ← valid
context->ks = 9 ← valid
context->mr_packed = 32 ← valid

Only a_offset is corrupted (0xffffa001... pattern = pointer authentication failure). This corrupted value is passed as lhs_ptr_offset to the KleidiAI packing kernel, which then corrupts pointer
arithmetic in in[y] += lhs_ptr_offset, leading to an invalid memory access in ld1b {za0h.b[w12, 0]}, p3/z, [x27, x22].

The indirect_a pointer array itself contains valid heap addresses:
0x1422f6a20: 0x0000600001360ed0 0x0000600001360f50 (valid)

All 4 threads crash simultaneously at the same address, suggesting the corruption is at the shared context level.

Stack Trace

#0 ld1b {za0h.b[w12, 0]}, p3/z, [x27, x22] (SME asm kernel)
#1 kai_run_lhs_imatmul_pack_x8p2vlx4_x8p_sme +292 (KleidiAI packing)
#2 compute_batch_inline_packed_igemm +224 (operator-run.c:888)
#3 xnn_compute_grouped_batch_inline_packed_igemm (operator-run.c:926)
#4 thread_parallelize_3d_tile_1d_dynamic_with_thread (portable-api.c:1717)

Code Path

The SME2 igemm path is only activated when XNN_ENABLE_ARM_SME2 is true, in gemm-config.c:399:

if (XNN_ENABLE_ARM_SME2 && (hardware_config->arch_flags & xnn_arch_arm_sme2)) {
    pqs8_qc8w_gemm_config.minmax.igemm[XNN_MR_TO_INDEX(mr)] =
        xnn_init_hmp_packed_igemm_ukernel(
            xnn_pqs8_qc8w_igemm_minmax_fp32_ukernel_32x32c4__neonsme2);
}

This explains why FP32 models are unaffectedthey don't use this quantized igemm path.

Workaround

Build with -DXNNPACK_ENABLE_ARM_SME2=OFF. SME(v1) kernels remain active and functional (benchmarked at ~11ms for ResNeXt50 quantized on M4).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions