opencl: add Adreno xmem attention fast path by happyyzy · Pull Request #1461 · ggml-org/ggml

happyyzy · 2026-04-19T12:19:47Z

Summary

add an Adreno-only OpenCL xmem attention route behind GGML_OPENCL_ADRENO_XMEM_ATTN
add the xmem QK/PV kernels plus exact split-source variants used to preserve the fast Adreno compiler path
route compatible ggml_flash_attn_ext cases to this path
add test-opencl-adreno-attn as a focused correctness/perf reproducer

Scope

This PR is intentionally narrow. It only touches the ggml-opencl attention path:

3 new kernel files
1 runtime routing/integration file
1 benchmark test target

It does not change quantization, GEMM, CLML, or non-Adreno code paths.

Supported cases

Adreno OpenCL only
q: f32, k/v: f16 or f32, out: f32
GQA
causal and noncausal
prefill and decode

Notes

The dedicated QK/PV split sources and their early compile order are intentional. On current Adreno drivers, compiling these xmem kernels later can produce a materially slower device binary even with identical source and build options.

Validation

Build validation on this branch:

cmake -S . -B build-adreno-xmem -DGGML_OPENCL=ON -DGGML_OPENCL_EMBED_KERNELS=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build-adreno-xmem --target ggml-opencl test-opencl-adreno-attn -j
./build-adreno-xmem/bin/test-opencl-adreno-attn --help

Correctness examples on Adreno 830:

noncausal + GQA: dq=128 dv=128 nq=256 nkv=512 n_head=8 n_head_kv=2 -> mae=6.7302e-05, max_abs=8.9369e-04, cos=0.999993535
causal + GQA: dq=128 dv=128 nq=256 nkv=256 n_head=8 n_head_kv=2 -> mae=7.8839e-05, max_abs=9.92954e-04, cos=0.999999313
decode: dq=128 dv=128 nq=1 nkv=512 n_head=8 n_head_kv=2 -> mae=7.4717e-05, max_abs=4.90941e-04, cos=0.999992445

Adreno 830 performance examples:

H=30, L=4224, D=128, noncausal: xmem 188.606 ms / 1.453 TOPS; FA killed by system; nofuse OOM
H=30, L=2048, D=128, noncausal: xmem 49.852 ms / 1.292 TOPS; FA 2015.568 ms / 0.03196 TOPS; nofuse 148.097 ms / 0.4350 TOPS
H=30, L=2048, D=128, causal prefill: xmem 46.069 ms / 1.398 TOPS; FA 2150.036 ms / 0.02996 TOPS; nofuse 2155.344 ms / 0.02989 TOPS
H=1, L=16384, D=512, noncausal: xmem 272.191 ms / 2.020 TOPS; nofuse 1084.935 ms / 0.5067 TOPS; FA hit map::at

Decode is not the target of this path and remains slower than the existing routes on small nq=1 cases:

nq=1 nkv=512 dq=dv=128 n_head=8 n_head_kv=2: xmem 6.562 ms; FA 0.491 ms; nofuse 0.282 ms
nq=1 nkv=16384 dq=dv=128 n_head=8 n_head_kv=2: xmem 12.350 ms; FA 8.276 ms; nofuse 3.535 ms

opencl: add Adreno xmem attention fast path

4817ca6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

opencl: add Adreno xmem attention fast path#1461

opencl: add Adreno xmem attention fast path#1461
happyyzy wants to merge 1 commit into
ggml-org:masterfrom
happyyzy:adreno-xmem-attn-clean

happyyzy commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

happyyzy commented Apr 19, 2026

Summary

Scope

Supported cases

Notes

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant