"Running llama.cpp with Adreno 830 GPU acceleration on Termux (Snapdragon 8 Elite)" #23736
Replies: 2 comments
-
|
Update (May 27, 2026) — After further benchmarking and optimizations, several corrections and improvements worth sharing. Threading correction: 4t > 6tThe original post claimed 6 threads (
4 threads avoid the cross-cluster latency variance seen with 6t (note the ±1.5 sd on tg128) while keeping enough cores for KV cache ops alongside the GPU. XMEM_GEMM — +5% prefill, +6% generationSetting
Small but consistent improvement across all batch sizes. No downside observed. Extended benchmark suiteMore comprehensive measurements with standard deviation (5 runs each):
CPU comparison for reference: pp512 = 24.17 t/s, tg128 = 7.95 t/s (6t, q8_0 KV, flash attn). Prompt cache & warm bootFor AI assistant workloads (e.g., opencode), added automated KV cache persistence:
Server config for AI assistantsThe Build optimizations
|
Beta Was this translation helpful? Give feedback.
-
|
Big boost for android with GPU , it is somwhat usable now on not too big context |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I got llama.cpp's OpenCL backend running on Qualcomm's Adreno 830 GPU inside Termux, on a Xiaomi Pad 8 Pro (Snapdragon 8 Elite). This is the first documented setup for this GPU.
Why this matters
llama.cpp has OpenCL support, but on Android the Adreno OpenCL driver lives in
/vendor/lib64/, which is invisible to Termux's linker. Making it work requires:/vendor/lib64/libOpenCL_adreno.soLD_LIBRARY_PATH=/vendor/lib64at runtime for the driver's dependencies (libcutils.so,libvndksupport.so)GGML_OPENCL_USE_ADRENO_KERNELS=ONf16KV cache (notq8_0, which crashes OpenCL'sSET_ROWS)Benchmark results
Qwen2.5-Coder-1.5B-Instruct Q8_0 on Adreno 830:
That's a 24x prefill speedup and 2.7x token generation speedup compared to CPU-only on this model.
Qwen2.5-Coder-7B Q4_K_M (for comparison):
For the 7B model, GPU gives 3.7x prefill speedup but ~17% slower tg due to CPU-GPU sync overhead per token.
Key findings
-C 0x3f) is optimal — 8 threads causes 55% tg regression on Snapdragon 8 Elite's asymmetric Oryon coresq8_0triggersSET_ROWScrash (issue #21501)-ngl > 0on OpenCL--mlockcauses OOM with GPU offload on 12 GB RAMarmv8.6-a+dotprod+fp16+i8mm+ KleidiAI provide meaningful CPU-side speedupsFull setup
Everything is packaged with ready-to-use scripts:
GitHub: paragon83114/llama-adreno
The repo includes
server.sh,chat.sh,download-model.sh, KV cache management scripts, and detailed documentation in the README.Looking forward
PR #23501 (Qualcomm) adds flash attention, K-split, and
q4_0KV for Adreno — should significantly improve tg performance once merged.Tested on: Xiaomi Pad 8 Pro, Snapdragon 8 Elite (SM8750P), 12 GB RAM, Adreno 830, Android 15, Termux (F-Droid).
Beta Was this translation helpful? Give feedback.
All reactions