RDNA4 Llama Experiments — Squeezing Every Token/s from the R9700 #21043
Replies: 8 comments 23 replies
-
|
@JohnTDI-cpu thanks again, do you mind sharing huggingface links to the two models you used |
Beta Was this translation helpful? Give feedback.
-
|
All my stats seem to perform better overall. Also i'm running this on a custom LACT R9700 Profile Differences I can tell which may have mattered are Models (GGUF)
System Configuration
Test ConfigurationEnvironment
|
| Flag | Value |
|---|---|
-t / threads |
1 |
-ngl |
99 |
-fa |
1 (flash attention on) |
-p |
128,512,2048,8192 (prefill sizes) |
-n |
128,512,2048 (decode / generation lengths) |
-r |
3 (repetitions) |
Not set: VK_ICD_FILENAMES
Configs (columns in results)
| Column | Binary / batching |
|---|---|
| Stock RADV | Default llama-bench batching: -b 2048, -ub 512 (no extra flags). ggml-vulkan.cpp: rm_kq = 2 (upstream default). |
| RADV+ub2048 | Same binary as stock; add -ub 2048 and -b 16384. |
| RADV+rm_kq1+ub2048 | Rebuild with uint32_t rm_kq = 1 in ggml/src/ggml-vulkan/ggml-vulkan.cpp (line that defaults to 2); same flags as RADV+ub2048. |
Backend is Vulkan / RADV via the cmake build (GGML_VULKAN=ON).
Detailed results: Qwen3.5-35B-A3B (MoE)
Decode
| Context | Stock RADV | RADV+ub2048 | RADV+rm_kq1+ub2048 |
|---|---|---|---|
| tg128 | 153.4 | 153.1 | 155.6 |
| tg512 | 152.3 | 151.7 | 151.8 |
| tg2048 | 149.7 | 150.4 | 152.3 |
Prefill
| Prompt | Stock RADV | RADV+ub2048 | RADV+rm_kq1+ub2048 |
|---|---|---|---|
| pp128 | 1839 | 1814 | 1742 |
| pp512 | 3314 | 3265 | 3257 |
| pp2048 | 3272 | 3964 | 3946 |
| pp8192 | 3131 | 3846 | 3832 |
Detailed results: Qwen3.5-27B (Dense)
Decode
| Context | Stock RADV | RADV+ub2048 | RADV+rm_kq1+ub2048 |
|---|---|---|---|
| tg128 | 32.25 | 32.17 | 32.06 |
| tg512 | 32.27 | 32.21 | 32.06 |
| tg2048 | 32.09 | 32.09 | 31.89 |
Prefill
| Prompt | Stock RADV | RADV+ub2048 | RADV+rm_kq1+ub2048 |
|---|---|---|---|
| pp128 | 841 | 821 | 830 |
| pp512 | 942 | 914 | 921 |
| pp2048 | 933 | 923 | 930 |
| pp8192 | 883 | 880 | 890 |
build: 48cda24c1 (8555)
Condensed comparison (three RADV configs)
All values are t/s from the detailed tables above. RADV+ub2048 and RADV+rm_kq1+ub2048 use absolute t/s with Δ vs Stock RADV in parentheses (percent, rounded).
| Model | Test | Stock RADV | RADV+ub2048 | RADV+rm_kq1+ub2048 |
|---|---|---|---|---|
| MoE 35B | tg128 | 153.4 | 153.1 (−0.2%) | 155.6 (+1.4%) |
| MoE 35B | pp512 | 3314 | 3265 (−1.5%) | 3257 (−1.7%) |
| MoE 35B | pp2048 | 3272 | 3964 (+21.2%) | 3946 (+20.6%) |
| MoE 35B | pp8192 | 3131 | 3846 (+22.8%) | 3832 (+22.4%) |
| Dense 27B | tg128 | 32.25 | 32.17 (−0.2%) | 32.06 (−0.6%) |
| Dense 27B | pp512 | 942 | 914 (−3.0%) | 921 (−2.2%) |
| Dense 27B | pp2048 | 933 | 923 (−1.1%) | 930 (−0.3%) |
| Dense 27B | pp8192 | 883 | 880 (−0.3%) | 890 (+0.8%) |
Beta Was this translation helpful? Give feedback.
-
|
Here's something interesting, for those folk with dual R9700s - messing with the power profiles can make a huge difference to TG. The results are...interesting, to say the least. Before: After: So a 14% increase in TG, but a significant decrease in PP. I don't have the tables to hand right now, but the change for Qwen3-30B-A3B-Q4_K_M was also interesting; 159t/s TG on a single card, 152t/s TG on two. I'm not entirely sure how to explain that. |
Beta Was this translation helpful? Give feedback.
-
|
anyone tried and tested the new qwen 3.6 27B and 35B optimisations like MTP , Turboquant and Dflash ? |
Beta Was this translation helpful? Give feedback.
-
AMD Radeon AI PRO R9700 (RDNA4, gfx1201) Benchmark & Optimization ResultsAMD Radeon AI PRO R9700 (RDNA4, gfx1201) 测试结果与优化经验分享Hardware / 硬件配置:
Optimizations confirmed on RDNA4 / RDNA4上验证有效的优化项All recommendations from this thread apply equally to gfx1201.
Results — Vulkan + MTP (spec-draft-n-max=3, parallel=1)测试结果 — Vulkan + MTP(spec-draft-n-max=3,parallel=1)
Key finding / 关键发现MTP provides a consistent ~2× speedup on RDNA4, compensating significantly for the narrower memory bus. R9700's tg/bandwidth ratio is slightly better than 7900XTX (69% vs 60%), suggesting RDNA4 is marginally more compute-efficient per GB/s. Why R9700 is slower than 7900XTX for decode / 为什么R9700解码比7900XTX慢: |
Beta Was this translation helpful? Give feedback.
-
|
My preliminary results testing MTP out , the best performance boost are actually for single GPU R9700 27B Q4 , and the one I actually use daily is dual GPU 35B Q8, gets a modest 6-7% boost at the expense of PP and TTFT do you guys have similar results , any tips on making 35B get as much boost as 27B does ? MTP BenchmarksSingle R9700 + Qwen3.6-27B-UD-Q4_K_XL + KV Q4_0Prompt Processing (PP)
TTFT (seconds)
Text Generation (TG)
TG Acceptance Rate
Dual R9700 + Qwen3.6-35B-A3B-UD-Q8_K_XL + KV F16Prompt Processing (PP)
TTFT (seconds)
Text Generation (TG)
TG Acceptance Rate
|
Beta Was this translation helpful? Give feedback.
-
|
After a few days of testing, I would like to share some of my thoughts. My system ubuntu 26.04,and I upgraded kernel to 7.0.0-15。 I tested vulkan and rocm, I think rocm is actully better when run Qwen3.6-27b (dense) docker pull ghcr.io/ggml-org/llama.cpp:full-rocm --->it is b9209 when I tested I run "amd-smi version" in container and check ROCm verion, and get: At first, the r9700 is optimized specifically for data types such as INT4, INT8, and FP8. Therefore—whether for model precision or KV cache precision—you'd better use Q4 or Q8. Do not use Q6, nor the mixed-precision UD-Q4_K_XL; otherwise, performance will suffer significantly (I have already tested a lot), particularly when using the ROCm driver. I used unsloth/Qwen3.6-27B-MTP-GGUF Qwen3.6-27B-Q8_0.gguf,I believe if you use Qwen3.6-27B-Q4_K_M.gguf will get better speed and context window. Fortunately, 32GB of VRAM is just enough to accommodate a Q8 model combined with a q4_0 KV cache. However, you cannot set the parameters to Now, see my test.(Qwen3.6-27B-Q8_0.gguf + q4_0 kv cache) Request Concurrency=1 Request Concurrency=1 Request Concurrency=1 Request Concurrency=1 The above is a single-request test for the Q8 model, which I consider sufficiently good. I'm too lazy to test Q4. Below are the results of two concurrent tests: Request Concurrency=2 thread1: thread2: As shown above, under concurrent conditions, the mtp for the second thread is extremely poor. I am not sure whether this is a bug. As pp continues to increase, tg deteriorates significantly, so I did not proceed with further testing. Below is my Docker Compose configuration: param refer to: for kv cache type: as mentioned before, the speed of q4_0 is far better than q4_1, iq4_nl, q5_0, q5_1. "-sm row" option will not impact speed obviously, "-sm layer" is default. before reproducing, you should config your machine: Step 1: Enable Resizable BAR in BIOS Step 2: Lock GPU to Highest Performance Level (maybe you don't need this step, see below "two key findings") Method 1: Using rocm-smi (Recommend) Method 2: Directly via sysfs Persistence: Step 3: Enable ASPM Method 1: Via Kernel Boot Parameters (Recommend) Method 2: Using a Script Verification after Reboot: Step 4: Disable ECC for RX 9700 Series GPUs (I'm not sure if this has any impact, but I did it anyway.) Add the following parameter: Verification after Reboot: Finally, my /etc/default/grub contains: then I run update-grub and verify that the kernel parameters appear correctly in /boot/grub/grub.cfg under the menu entries. Two key findings! (I only tested rocm) First: Second: The latter is more power-efficient (idle power consumption typically hovers around 20W in Shown below is the
|
Beta Was this translation helpful? Give feedback.
-
|
The OP deserves a special gold star. These 2 optimizations boosted performance for Qwen3 122B from 15 t/s to 48 t/s with 40 t/s achieved with just #1. And it worked even with RPC using 2 nodes. (1) -ub 2048 | MoE 35B | RADV | +29% prefill pp2048 | -ub 2048 -b 16384 I saw significant performance improvements with other MOE models like Qwen 3 Next 80. I have not yet seen any boost with dense models but I just started testing. Node 1: Node 2: Tyvm for this post! |
Beta Was this translation helpful? Give feedback.

Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
50+ experiments over several days to find every optimization that matters for llama.cpp Vulkan on RDNA4. All benchmarks were run and verified manually on real hardware. Claude (Anthropic) assisted throughout — helping analyze results, suggest hypotheses for unexpected findings (like the PCIe ASPM discovery), and structure this document. Full results below.
System Configuration
dc8d14c58(build 8554)cmake -DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=ReleaseDriver identification
RADV reports:
AMD Radeon AI PRO R9700 (RADV GFX1201) (radv)AMDVLK reports:
AMD Radeon AI PRO R9700 (AMD open-source driver)All benchmarks use explicit
VK_ICD_FILENAMESto guarantee driver selection.Models Tested
Results: Qwen3.5-35B-A3B (MoE, 35B total, ~3.5B active)
Decode
FA ON, 3 reps, values in tokens/s.
Prefill
FA ON, 3 reps, values in tokens/s.
Results: Qwen3.5-27B (Dense, 27B)
Decode
Prefill
RADV vs AMDVLK
RADV wins overall. AMDVLK has a moderate decode advantage on MoE (+3.7%), but RADV's prefill is dramatically faster, especially on dense models where AMDVLK is nearly 4× slower.
Optimization Impact (RADV)
-ub 2048rm_kq=1rm_kq=1code changeOne line in
ggml/src/ggml-vulkan/ggml-vulkan.cpp:AMDVLK + rm_kq=1 (surprise finding)
rm_kq=1has a large effect on AMDVLK dense decode (+13%), much more than RADV (+1%). This suggests AMDVLK's LLPC compiler benefits more from reduced register pressure on RDNA4.Quality & VRAM Verification
Qwen3.5-35B-A3B — WikiText-2 Perplexity
PPL and VRAM identical across all configurations. No quality or memory impact from any optimization.
Reproduction
Exhaustive Flag Testing
Qwen3.5-35B-A3B (MoE) — Decode tg128, rm_kq=1 active
RADV experiments
gfx queue has zero effect on RADV 35B MoE decode. Disable fusion catastrophically hurts.
AMDVLK experiments
gfx queue gives +4.7% on AMDVLK 35B MoE. No other flag breaks through 164 t/s.
Qwen3.5-27B (Dense) — Decode tg128, rm_kq=1 active
RADV experiments
Nothing moves RADV 27B decode. 29.3 t/s = hard BW ceiling (15.58 GiB × 29.3 = 456 GB/s = 71% of 640 GB/s).
AMDVLK experiments
AMDVLK + rm_kq=1 without gfx = best dense decode (32.73 t/s, +13% over stock rm_kq=2!)
gfx queue HURTS dense AMDVLK by -8% — opposite of MoE where it helps +4.7%.
rm_kq impact across all configs
rm_kq=1has the largest impact on AMDVLK dense decode (+13%). This suggests AMDVLK's LLPC compiler benefits significantly from reduced VGPR pressure on RDNA4 wave32 architecture. RADV's ACO compiler handles register allocation differently, gaining less from the same change.Best Achievable Performance
35B MoE
27B Dense
Dense decode improved by +10.8% on RADV and +14.5% on AMDVLK (vs stock rm_kq=2 + ASPM default) from combined
rm_kq=1+ PCIe ASPM performance mode.Key findings
rm_kq=1is the single most impactful code change: +1% RADV, +2% AMDVLK MoE, +13% AMDVLK dense.PCIe ASPM Discovery
Setting PCIe ASPM to performance mode eliminates L1 exit latency:
ASPM L1 power saving adds latency to every PCIe transaction. Dense models suffer most because they read the entire model (~15.6 GB) every token with many small transactions. MoE models batch work more efficiently, hiding PCIe latency.
This is a system-level optimization — no code change, no driver change. Persists until reboot. To make permanent: add
pcie_aspm.policy=performanceto kernel boot parameters.Known Issues
GGML_VK_DISABLE_COOPMAT=1) improves AMDVLK dense prefill by +17% (207→243) — suggests AMDVLK's cooperative matrix codegen is suboptimal for dense models. RADV's coopmat works correctly.Exhaustive Experiment Log (50+ combinations tested)
Parameters with REAL impact
echo performance > /sys/module/pcie_aspm/parameters/policy-ub 2048 -b 16384GGML_VK_ALLOW_GRAPHICS_QUEUE=1GGML_VK_DISABLE_COOPMAT=1Parameters with ZERO impact (all tested, all confirmed ±0.3%)
RADV flags: gfx queue (on RADV), RADV_DEBUG=nocompute, RADV_PERFTEST=sam/bolist/localbos/dmashaders/nircache/hic/nogttspill, RADV_PROFILE_PSTATE
llama.cpp env vars: GGML_VK_DISABLE(F16/BF16/COOPMAT2/INTEGER_DOT_PRODUCT/ASYNC/GRAPH_OPTIMIZE), GGML_VK_FORCE_MMVQ, GGML_VK_DISABLE_MMVQ, GGML_VK_DMMV_LARGE, GGML_VK_ENABLE_MEMORY_PRIORITY, GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM, GGML_VK_FORCE_MAX_ALLOCATION_SIZE, GGML_VK_FORCE_MAX_BUFFER_SIZE, GGML_VK_SUBALLOCATION_BLOCK_SIZE (16MB and 1GB)
llama.cpp params: -t 1/4/12 (thread count), --no-host, -nopo (no-op-offload), -dio (direct-io), -mmp 0 (no mmap), -sm row (split mode), -b 1/2/512 (batch size), --prio 2 (scheduling priority), -ctk/-ctv q8_0/q4_0 (KV cache quant)
Code changes: rm_stdq=2, rm_kq_int=2, rm_stdq_int=2, rm_kq=3/4
System tuning: hugepages (16GB), transparent hugepages=always, CPU pinning (taskset), nice -n -20, GPU power profile (COMPUTE/3D_FULL_SCREEN)
DISABLE_FUSION is catastrophic: -18.5% on MoE, -5.1% on dense. Never disable.
Bandwidth utilization analysis
Dense models reach 79-83% BW utilization with ASPM fix. MoE models are lower (56-61%) due to dispatch overhead from expert routing. The remaining 17-20% gap on dense is primarily from:
s_wait_kmcntper Q4K GEMV shader)Please share your discoveries too — I'm curious what's the max we can get out of RDNA4.
Beta Was this translation helpful? Give feedback.
All reactions