"Running llama.cpp with Adreno 830 GPU acceleration on Termux (Snapdragon 8 Elite)" #23736

paragon83114 · 2026-05-26T17:24:53Z

paragon83114
May 26, 2026

I got llama.cpp's OpenCL backend running on Qualcomm's Adreno 830 GPU inside Termux, on a Xiaomi Pad 8 Pro (Snapdragon 8 Elite). This is the first documented setup for this GPU.

Why this matters

llama.cpp has OpenCL support, but on Android the Adreno OpenCL driver lives in /vendor/lib64/, which is invisible to Termux's linker. Making it work requires:

An ICD entry pointing to /vendor/lib64/libOpenCL_adreno.so
LD_LIBRARY_PATH=/vendor/lib64 at runtime for the driver's dependencies (libcutils.so, libvndksupport.so)
Building with GGML_OPENCL_USE_ADRENO_KERNELS=ON
Using f16 KV cache (not q8_0, which crashes OpenCL's SET_ROWS)

Benchmark results

Qwen2.5-Coder-1.5B-Instruct Q8_0 on Adreno 830:

Config	pp512 (t/s)	tg128 (t/s)
GPU ngl=99, f16 KV, 6t	579	21.3
CPU-only 6t, q8_0 KV	24.17	7.95

That's a 24x prefill speedup and 2.7x token generation speedup compared to CPU-only on this model.

Qwen2.5-Coder-7B Q4_K_M (for comparison):

Config	pp512 (t/s)	tg128 (t/s)
GPU ngl=99, f16 KV	89.19	6.60
CPU-only 6t, q8_0 KV	24.17	7.95

For the 7B model, GPU gives 3.7x prefill speedup but ~17% slower tg due to CPU-GPU sync overhead per token.

Key findings

6 threads pinned to performance cores (-C 0x3f) is optimal — 8 threads causes 55% tg regression on Snapdragon 8 Elite's asymmetric Oryon cores
f16 KV cache is required with OpenCL — q8_0 triggers SET_ROWS crash (issue #21501)
Flash attention is disabled when -ngl > 0 on OpenCL
--mlock causes OOM with GPU offload on 12 GB RAM
ARM arch flags armv8.6-a+dotprod+fp16+i8mm + KleidiAI provide meaningful CPU-side speedups

Full setup

Everything is packaged with ready-to-use scripts:

GitHub: paragon83114/llama-adreno

git clone https://github.com/paragon83114/llama-adreno.git
cd llama-adreno
bash setup.sh    # installs deps, builds llama.cpp with Adreno OpenCL, downloads model
bash server.sh   # starts llama-server with GPU+CPU hybrid config

The repo includes server.sh, chat.sh, download-model.sh, KV cache management scripts, and detailed documentation in the README.

Looking forward

PR #23501 (Qualcomm) adds flash attention, K-split, and q4_0 KV for Adreno — should significantly improve tg performance once merged.

Tested on: Xiaomi Pad 8 Pro, Snapdragon 8 Elite (SM8750P), 12 GB RAM, Adreno 830, Android 15, Termux (F-Droid).

paragon83114 · 2026-05-27T07:45:59Z

paragon83114
May 27, 2026
Author

Update (May 27, 2026) — After further benchmarking and optimizations, several corrections and improvements worth sharing.

Threading correction: 4t > 6t

The original post claimed 6 threads (-C 0x3f) was optimal. Systematic benchmarking across 2t, 4t, and 6t shows 4 threads pinned to cores 0-3 (-C 0xf) is the sweet spot:

Config	pp512 (t/s)	tg128 (t/s)
2t (cores 0-1)	571 ± 4	31.1 ± 0.1
4t (cores 0-3)	562 ± 12	31.4 ± 0.0
6t (cores 0-5)	561 ± 13	30.5 ± 1.5

4 threads avoid the cross-cluster latency variance seen with 6t (note the ±1.5 sd on tg128) while keeping enough cores for KV cache ops alongside the GPU.

XMEM_GEMM — +5% prefill, +6% generation

Setting GGML_OPENCL_ADRENO_XMEM_GEMM=1 enables Adreno-specific F16xF16 GEMM with temporary weight prepacking:

Config	pp2048 (t/s)	tg256 (t/s)
w/o XMEM_GEMM	460 ± 3	26.6 ± 0.1
w/ XMEM_GEMM	482 ± 5	30.1 ± 1.4

Small but consistent improvement across all batch sizes. No downside observed.

Extended benchmark suite

More comprehensive measurements with standard deviation (5 runs each):

Config	pp512	pp2048	tg128	tg256
GPU 4t	562 ± 12	460 ± 3	31.4 ± 0.0	26.6 ± 0.1
GPU 4t + XMEM	570 ± 7	482 ± 5	30.7 ± 0.2	30.1 ± 1.4

CPU comparison for reference: pp512 = 24.17 t/s, tg128 = 7.95 t/s (6t, q8_0 KV, flash attn).

Prompt cache & warm boot

For AI assistant workloads (e.g., opencode), added automated KV cache persistence:

Prompt cache: --cache-reuse 256 + --cache-ram 1024 achieves 98% hit rate on conversations with repeated system prompts
Warm boot: KV slot auto-saves to disk on server shutdown and restores on startup — first request goes from 130s (cold) → 20.6s (warm)

Server config for AI assistants

The server.sh now includes these flags tuned for chat/assistant UX:

--ctx-size 16384 --batch-size 2048 --ubatch-size 512
--cont-batching --cache-idle-slots --cache-ram 1024
--cache-reuse 256 --kv-unified --keep -1
--slot-save-path ./cache --poll 100 --timeout 600

Build optimizations

LLAMA_BUILD_UI=OFF and LLAMA_OPENSSL=OFF reduce binary size and link-time dependencies. ccache is recommended for faster rebuilds during iteration. Full cmake flags and updated scripts at github.com/paragon83114/llama-adreno.

0 replies

Vasili-Sk · 2026-06-01T04:27:12Z

Vasili-Sk
Jun 1, 2026

Big boost for android with GPU , it is somwhat usable now on not too big context

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Running llama.cpp with Adreno 830 GPU acceleration on Termux (Snapdragon 8 Elite)" #23736

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

"Running llama.cpp with Adreno 830 GPU acceleration on Termux (Snapdragon 8 Elite)" #23736

Uh oh!

paragon83114 May 26, 2026

Why this matters

Benchmark results

Key findings

Full setup

Looking forward

Replies: 2 comments

Uh oh!

paragon83114 May 27, 2026 Author

Threading correction: 4t > 6t

XMEM_GEMM — +5% prefill, +6% generation

Extended benchmark suite

Prompt cache & warm boot

Server config for AI assistants

Build optimizations

Uh oh!

Vasili-Sk Jun 1, 2026

paragon83114
May 26, 2026

paragon83114
May 27, 2026
Author

Vasili-Sk
Jun 1, 2026