Environment
- macOS 15.6, Apple M4 (ARM64)
- ExecuTorch v1.0.1 / v1.1.0 (XNNPACK commit 3131afe)
- KleidiAI v1.11.0 (default) and v1.23.0 (latest) — both reproduce
XNNPACK_ENABLE_ARM_SME=ON, XNNPACK_ENABLE_ARM_SME2=ON
Summary
When XNNPACK_ENABLE_ARM_SME2=ON on macOS ARM64 (Apple M4), running quantized convolution models (e.g., ResNeXt50 INT8) causes a segfault in kai_run_lhs_imatmul_pack_x8p2vlx4_x8p_sme, called from
compute_batch_inline_packed_igemm. FP32 models work fine. Setting XNNPACK_ENABLE_ARM_SME2=OFF resolves the issue.
This is consistent with XNNPACK's own default of SME2=OFF with the comment: "Only enable this by default once we're able to test all SME2 kernels continuously."
Steps to Reproduce
- Build on macOS ARM64 (Apple M4) with
-DXNNPACK_ENABLE_ARM_SME2=ON
- Run a quantized (INT8) convolution model with grouped convolutions (e.g., ResNeXt50)
- Segfault occurs after "Method loaded", during the first inference
lldb Analysis
Debugged the crash with lldb. The igemm_context fields at crash time:
context->a_offset = 0xffffa0013f63a730 ← PAC-corrupted (INVALID)
context->indirect_a = 0x00000001422c0000 ← valid
context->workspace = 0x000000014099e000 ← valid
context->zero = 0x00006000013018c0 ← valid
context->ga_stride = 0x4 ← valid
context->kc = 4 ← valid
context->ks = 9 ← valid
context->mr_packed = 32 ← valid
Only a_offset is corrupted (0xffffa001... pattern = pointer authentication failure). This corrupted value is passed as lhs_ptr_offset to the KleidiAI packing kernel, which then corrupts pointer
arithmetic in in[y] += lhs_ptr_offset, leading to an invalid memory access in ld1b {za0h.b[w12, 0]}, p3/z, [x27, x22].
The indirect_a pointer array itself contains valid heap addresses:
0x1422f6a20: 0x0000600001360ed0 0x0000600001360f50 (valid)
All 4 threads crash simultaneously at the same address, suggesting the corruption is at the shared context level.
Stack Trace
#0 ld1b {za0h.b[w12, 0]}, p3/z, [x27, x22] (SME asm kernel)
#1 kai_run_lhs_imatmul_pack_x8p2vlx4_x8p_sme +292 (KleidiAI packing)
#2 compute_batch_inline_packed_igemm +224 (operator-run.c:888)
#3 xnn_compute_grouped_batch_inline_packed_igemm (operator-run.c:926)
#4 thread_parallelize_3d_tile_1d_dynamic_with_thread (portable-api.c:1717)
Code Path
The SME2 igemm path is only activated when XNN_ENABLE_ARM_SME2 is true, in gemm-config.c:399:
if (XNN_ENABLE_ARM_SME2 && (hardware_config->arch_flags & xnn_arch_arm_sme2)) {
pqs8_qc8w_gemm_config.minmax.igemm[XNN_MR_TO_INDEX(mr)] =
xnn_init_hmp_packed_igemm_ukernel(
xnn_pqs8_qc8w_igemm_minmax_fp32_ukernel_32x32c4__neonsme2);
}
This explains why FP32 models are unaffected — they don't use this quantized igemm path.
Workaround
Build with -DXNNPACK_ENABLE_ARM_SME2=OFF. SME(v1) kernels remain active and functional (benchmarked at ~11ms for ResNeXt50 quantized on M4).
Environment
XNNPACK_ENABLE_ARM_SME=ON,XNNPACK_ENABLE_ARM_SME2=ONSummary
When
XNNPACK_ENABLE_ARM_SME2=ONon macOS ARM64 (Apple M4), running quantized convolution models (e.g., ResNeXt50 INT8) causes a segfault inkai_run_lhs_imatmul_pack_x8p2vlx4_x8p_sme, called fromcompute_batch_inline_packed_igemm. FP32 models work fine. SettingXNNPACK_ENABLE_ARM_SME2=OFFresolves the issue.This is consistent with XNNPACK's own default of
SME2=OFFwith the comment: "Only enable this by default once we're able to test all SME2 kernels continuously."Steps to Reproduce
-DXNNPACK_ENABLE_ARM_SME2=ONlldb Analysis
Debugged the crash with lldb. The
igemm_contextfields at crash time:context->a_offset = 0xffffa0013f63a730 ← PAC-corrupted (INVALID)
context->indirect_a = 0x00000001422c0000 ← valid
context->workspace = 0x000000014099e000 ← valid
context->zero = 0x00006000013018c0 ← valid
context->ga_stride = 0x4 ← valid
context->kc = 4 ← valid
context->ks = 9 ← valid
context->mr_packed = 32 ← valid
Only
a_offsetis corrupted (0xffffa001...pattern = pointer authentication failure). This corrupted value is passed aslhs_ptr_offsetto the KleidiAI packing kernel, which then corrupts pointerarithmetic in
in[y] += lhs_ptr_offset, leading to an invalid memory access inld1b {za0h.b[w12, 0]}, p3/z, [x27, x22].The
indirect_apointer array itself contains valid heap addresses:0x1422f6a20: 0x0000600001360ed0 0x0000600001360f50 (valid)
All 4 threads crash simultaneously at the same address, suggesting the corruption is at the shared context level.
Stack Trace
#0 ld1b {za0h.b[w12, 0]}, p3/z, [x27, x22] (SME asm kernel)
#1 kai_run_lhs_imatmul_pack_x8p2vlx4_x8p_sme +292 (KleidiAI packing)
#2 compute_batch_inline_packed_igemm +224 (operator-run.c:888)
#3 xnn_compute_grouped_batch_inline_packed_igemm (operator-run.c:926)
#4 thread_parallelize_3d_tile_1d_dynamic_with_thread (portable-api.c:1717)
Code Path
The SME2 igemm path is only activated when
XNN_ENABLE_ARM_SME2is true, ingemm-config.c:399: