SME2 igemm path causes segfault on macOS ARM64 (Apple M4) due to corrupted a_offset in igemm_context

## Environment
- macOS 15.6, Apple M4 (ARM64)
- ExecuTorch v1.0.1 / v1.1.0 (XNNPACK commit 3131afead790c5c69a9aa12273dfc40399789ad7)
- KleidiAI v1.11.0 (default) and v1.23.0 (latest) — both reproduce
- `XNNPACK_ENABLE_ARM_SME=ON`, `XNNPACK_ENABLE_ARM_SME2=ON`

## Summary

When `XNNPACK_ENABLE_ARM_SME2=ON` on macOS ARM64 (Apple M4), running quantized convolution models (e.g., ResNeXt50 INT8) causes a segfault in `kai_run_lhs_imatmul_pack_x8p2vlx4_x8p_sme`, called from
`compute_batch_inline_packed_igemm`. FP32 models work fine. Setting `XNNPACK_ENABLE_ARM_SME2=OFF` resolves the issue.

This is consistent with XNNPACK's own default of `SME2=OFF` with the comment: *"Only enable this by default once we're able to test all SME2 kernels continuously."*

## Steps to Reproduce

1. Build on macOS ARM64 (Apple M4) with `-DXNNPACK_ENABLE_ARM_SME2=ON`
2. Run a quantized (INT8) convolution model with grouped convolutions (e.g., ResNeXt50)
3. Segfault occurs after "Method loaded", during the first inference

## lldb Analysis

Debugged the crash with lldb. The `igemm_context` fields at crash time:

context->a_offset    = 0xffffa0013f63a730  ← PAC-corrupted (INVALID)
context->indirect_a  = 0x00000001422c0000  ← valid
context->workspace   = 0x000000014099e000  ← valid
context->zero        = 0x00006000013018c0  ← valid
context->ga_stride   = 0x4                 ← valid
context->kc          = 4                   ← valid
context->ks          = 9                   ← valid
context->mr_packed   = 32                  ← valid

Only `a_offset` is corrupted (`0xffffa001...` pattern = pointer authentication failure). This corrupted value is passed as `lhs_ptr_offset` to the KleidiAI packing kernel, which then corrupts pointer
arithmetic in `in[y] += lhs_ptr_offset`, leading to an invalid memory access in `ld1b {za0h.b[w12, 0]}, p3/z, [x27, x22]`.

The `indirect_a` pointer array itself contains valid heap addresses:
0x1422f6a20: 0x0000600001360ed0 0x0000600001360f50  (valid)

All 4 threads crash simultaneously at the same address, suggesting the corruption is at the shared context level.

## Stack Trace

#0  ld1b {za0h.b[w12, 0]}, p3/z, [x27, x22]         (SME asm kernel)
#1  kai_run_lhs_imatmul_pack_x8p2vlx4_x8p_sme +292   (KleidiAI packing)
#2  compute_batch_inline_packed_igemm +224             (operator-run.c:888)
#3  xnn_compute_grouped_batch_inline_packed_igemm      (operator-run.c:926)
#4  thread_parallelize_3d_tile_1d_dynamic_with_thread  (portable-api.c:1717)

## Code Path

The SME2 igemm path is only activated when `XNN_ENABLE_ARM_SME2` is true, in `gemm-config.c:399`:

```c
if (XNN_ENABLE_ARM_SME2 && (hardware_config->arch_flags & xnn_arch_arm_sme2)) {
    pqs8_qc8w_gemm_config.minmax.igemm[XNN_MR_TO_INDEX(mr)] =
        xnn_init_hmp_packed_igemm_ukernel(
            xnn_pqs8_qc8w_igemm_minmax_fp32_ukernel_32x32c4__neonsme2);
}

This explains why FP32 models are unaffected — they don't use this quantized igemm path.

Workaround

Build with -DXNNPACK_ENABLE_ARM_SME2=OFF. SME(v1) kernels remain active and functional (benchmarked at ~11ms for ResNeXt50 quantized on M4).
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SME2 igemm path causes segfault on macOS ARM64 (Apple M4) due to corrupted a_offset in igemm_context #9898

Environment

Summary

Steps to Reproduce

lldb Analysis

Stack Trace

Code Path

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SME2 igemm path causes segfault on macOS ARM64 (Apple M4) due to corrupted a_offset in igemm_context #9898

Description

Environment

Summary

Steps to Reproduce

lldb Analysis

Stack Trace

Code Path

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions