Skip to content

opencl: add Adreno xmem attention fast path#1461

Open
happyyzy wants to merge 1 commit into
ggml-org:masterfrom
happyyzy:adreno-xmem-attn-clean
Open

opencl: add Adreno xmem attention fast path#1461
happyyzy wants to merge 1 commit into
ggml-org:masterfrom
happyyzy:adreno-xmem-attn-clean

Conversation

@happyyzy
Copy link
Copy Markdown
Contributor

Summary

  • add an Adreno-only OpenCL xmem attention route behind GGML_OPENCL_ADRENO_XMEM_ATTN
  • add the xmem QK/PV kernels plus exact split-source variants used to preserve the fast Adreno compiler path
  • route compatible ggml_flash_attn_ext cases to this path
  • add test-opencl-adreno-attn as a focused correctness/perf reproducer

Scope

This PR is intentionally narrow. It only touches the ggml-opencl attention path:

  • 3 new kernel files
  • 1 runtime routing/integration file
  • 1 benchmark test target

It does not change quantization, GEMM, CLML, or non-Adreno code paths.

Supported cases

  • Adreno OpenCL only
  • q: f32, k/v: f16 or f32, out: f32
  • GQA
  • causal and noncausal
  • prefill and decode

Notes

The dedicated QK/PV split sources and their early compile order are intentional. On current Adreno drivers, compiling these xmem kernels later can produce a materially slower device binary even with identical source and build options.

Validation

Build validation on this branch:

  • cmake -S . -B build-adreno-xmem -DGGML_OPENCL=ON -DGGML_OPENCL_EMBED_KERNELS=ON -DCMAKE_BUILD_TYPE=Release
  • cmake --build build-adreno-xmem --target ggml-opencl test-opencl-adreno-attn -j
  • ./build-adreno-xmem/bin/test-opencl-adreno-attn --help

Correctness examples on Adreno 830:

  • noncausal + GQA: dq=128 dv=128 nq=256 nkv=512 n_head=8 n_head_kv=2 -> mae=6.7302e-05, max_abs=8.9369e-04, cos=0.999993535
  • causal + GQA: dq=128 dv=128 nq=256 nkv=256 n_head=8 n_head_kv=2 -> mae=7.8839e-05, max_abs=9.92954e-04, cos=0.999999313
  • decode: dq=128 dv=128 nq=1 nkv=512 n_head=8 n_head_kv=2 -> mae=7.4717e-05, max_abs=4.90941e-04, cos=0.999992445

Adreno 830 performance examples:

  • H=30, L=4224, D=128, noncausal: xmem 188.606 ms / 1.453 TOPS; FA killed by system; nofuse OOM
  • H=30, L=2048, D=128, noncausal: xmem 49.852 ms / 1.292 TOPS; FA 2015.568 ms / 0.03196 TOPS; nofuse 148.097 ms / 0.4350 TOPS
  • H=30, L=2048, D=128, causal prefill: xmem 46.069 ms / 1.398 TOPS; FA 2150.036 ms / 0.02996 TOPS; nofuse 2155.344 ms / 0.02989 TOPS
  • H=1, L=16384, D=512, noncausal: xmem 272.191 ms / 2.020 TOPS; nofuse 1084.935 ms / 0.5067 TOPS; FA hit map::at

Decode is not the target of this path and remains slower than the existing routes on small nq=1 cases:

  • nq=1 nkv=512 dq=dv=128 n_head=8 n_head_kv=2: xmem 6.562 ms; FA 0.491 ms; nofuse 0.282 ms
  • nq=1 nkv=16384 dq=dv=128 n_head=8 n_head_kv=2: xmem 12.350 ms; FA 8.276 ms; nofuse 3.535 ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant