RDNA4 Llama Experiments — Squeezing Every Token/s from the R9700 #21043

JohnTDI-cpu · 2026-03-26T20:35:48Z

JohnTDI-cpu
Mar 26, 2026

50+ experiments over several days to find every optimization that matters for llama.cpp Vulkan on RDNA4. All benchmarks were run and verified manually on real hardware. Claude (Anthropic) assisted throughout — helping analyze results, suggest hypotheses for unexpected findings (like the PCIe ASPM discovery), and structure this document. Full results below.

System Configuration

Component	Details
GPU	AMD Radeon AI PRO R9700 (gfx1201, RDNA4, 32 GB GDDR6, 64 CUs)
Memory bandwidth	640 GB/s (MCLK 1258 MHz, level 5/5 — verified during every test)
PCIe	PCIe 5.0 x16, 32 GT/s
CPU	AMD Ryzen 9 9900X 12-Core
RAM	64 GB DDR5
OS	Ubuntu 24.04.4 LTS
Kernel	6.19.8-061908-generic
Mesa (RADV)	25.2.8-0ubuntu0.24.04.1
AMDVLK	Installed alongside RADV
llama.cpp	commit `dc8d14c58` (build 8554)
Build	`cmake -DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release`

Driver identification

RADV reports: AMD Radeon AI PRO R9700 (RADV GFX1201) (radv)
AMDVLK reports: AMD Radeon AI PRO R9700 (AMD open-source driver)

All benchmarks use explicit VK_ICD_FILENAMES to guarantee driver selection.

# RADV:
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json GGML_VK_VISIBLE_DEVICES=1
 
# AMDVLK:
VK_ICD_FILENAMES=/etc/vulkan/icd.d/amd_icd64.json  # use -dev Vulkan1 for dGPU

Models Tested

Model	Type	Total Params	Active/Token	File Size	Quantization
Qwen3.5-35B-A3B	MoE	34.66B	~3.5B	18.32 GiB	UD-Q4_K_XL
Qwen3.5-27B	Dense	26.90B	26.90B	15.58 GiB	Q4_K_M

Results: Qwen3.5-35B-A3B (MoE, 35B total, ~3.5B active)

Decode

FA ON, 3 reps, values in tokens/s.

Context	Stock RADV	Stock AMDVLK	RADV+ub2048	RADV+rm_kq1+ub2048	AMDVLK+rm_kq1
tg128	147.8	153.2	147.9	149.2	156.3
tg512	146.4	151.4	146.0	148.7	155.3
tg2048	143.9	148.4	144.4	146.8	—

Prefill

FA ON, 3 reps, values in tokens/s.

Prompt	Stock RADV	Stock AMDVLK	RADV+ub2048	RADV+rm_kq1+ub2048	FA OFF+ub2048
pp128	1,210	1,134	1,209	1,207	—
pp512	2,404	1,831	2,400	2,400	2,398
pp2048	2,381	1,823	3,074	3,075	2,983
pp8192	2,262	1,742	2,983	2,984	—

Results: Qwen3.5-27B (Dense, 27B)

Decode

Context	Stock RADV	Stock AMDVLK	RADV+ub2048	RADV+rm_kq1+ub2048	AMDVLK+rm_kq1
tg128	29.07	29.01	29.01	29.31	32.75
tg512	29.08	29.15	29.05	29.32	32.82
tg2048	28.97	28.96	28.98	29.25	—

Prefill

Prompt	Stock RADV	Stock AMDVLK	RADV+ub2048	RADV+rm_kq1+ub2048	FA OFF+ub2048
pp128	631	182	631	631	—
pp512	798	203	799	800	795
pp2048	799	202	823	823	806
pp8192	772	199	797	797	—

RADV vs AMDVLK

Metric	RADV	AMDVLK	Winner
35B MoE decode tg128	147.8	153.2	AMDVLK +3.7%
35B MoE prefill pp512	2,404	1,831	RADV +31%
35B MoE prefill pp2048	2,381	1,823	RADV +31%
27B Dense decode tg128	29.07	29.01	Same
27B Dense prefill pp512	798	203	RADV +293%

RADV wins overall. AMDVLK has a moderate decode advantage on MoE (+3.7%), but RADV's prefill is dramatically faster, especially on dense models where AMDVLK is nearly 4× slower.

Optimization Impact (RADV)

Optimization	35B decode	35B prefill pp2048	27B decode	27B prefill pp2048
Stock	147.8	2,381	29.07	799
+ `-ub 2048`	147.9 (+0%)	3,074 (+29%)	29.01 (+0%)	823 (+3%)
+ `rm_kq=1`	149.2 (+1%)	3,075 (+0%)	29.31 (+1%)	823 (+0%)
+ FA ON	—	+3% vs FA OFF	—	+2% vs FA OFF

`rm_kq=1` code change

One line in ggml/src/ggml-vulkan/ggml-vulkan.cpp:

uint32_t rm_kq = 1; // was 2; reduces VGPR pressure, improves RDNA4 occupancy

AMDVLK + rm_kq=1 (surprise finding)

Model	AMDVLK stock	AMDVLK+rm_kq1	Improvement
35B MoE tg128	153.2	156.3	+2.0%
27B Dense tg128	29.01	32.75	+12.9%

rm_kq=1 has a large effect on AMDVLK dense decode (+13%), much more than RADV (+1%). This suggests AMDVLK's LLPC compiler benefits more from reduced register pressure on RDNA4.

Quality & VRAM Verification

Qwen3.5-35B-A3B — WikiText-2 Perplexity

Config	PPL	Model VRAM	KV VRAM	Compute VRAM	Total
Stock (rm_kq=2)	6.9472 ± 0.046	18,492 MiB	40 MiB	498 MiB	19,030 MiB
rm_kq=1	6.9472 ± 0.046	18,492 MiB	40 MiB	498 MiB	19,030 MiB
rm_kq=1 + ub=2048	6.9472 ± 0.046	18,492 MiB	40 MiB	498 MiB	19,030 MiB

PPL and VRAM identical across all configurations. No quality or memory impact from any optimization.

Reproduction

# Build
git clone https://github.com/ggerganov/llama.cpp.git && cd llama.cpp
cmake -B build -DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc) -- llama-bench
 
# Verify setup
cat /sys/class/drm/card1/device/pp_dpm_mclk | grep "*"  # must show 1258Mhz
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json \
GGML_VK_VISIBLE_DEVICES=1 ./build/bin/llama-bench --list-devices  # must show "(RADV GFX1201)"
 
# Stock benchmark
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json GGML_VK_VISIBLE_DEVICES=1 \
./build/bin/llama-bench -m MODEL.gguf -t 1 -ngl 99 -fa 1 \
  -p 128,512,2048,8192 -n 128,512,2048 -r 3
 
# Optimized (add -ub 2048 for prefill boost)
VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json GGML_VK_VISIBLE_DEVICES=1 \
./build/bin/llama-bench -m MODEL.gguf -t 1 -ngl 99 -fa 1 \
  -p 128,512,2048,8192 -n 128,512,2048 -ub 2048 -b 16384 -r 3

Exhaustive Flag Testing

Qwen3.5-35B-A3B (MoE) — Decode tg128, rm_kq=1 active

RADV experiments

Flag	tg128	vs stock 149.5
Stock (rm_kq=1, no env)	149.5	—
+ gfx queue	149.2	-0.2%
+ nocompute	149.3	-0.1%
+ gfx + nocompute	149.4	-0.1%
+ disable coopmat	149.3	-0.1%
+ force MMVQ	149.5	0%
+ disable int dot	149.3	-0.1%
+ FA OFF	149.3	-0.1%
+ disable F16	149.5	0%
+ q8_0 KV	148.8	-0.5%
+ disable graph opt	148.6	-0.6%
+ bolist	149.7	+0.1%
+ SAM	148.8	-0.5%
+ disable fusion	121.8	-18.5%

gfx queue has zero effect on RADV 35B MoE decode. Disable fusion catastrophically hurts.

AMDVLK experiments

Flag	tg128	vs stock 156.3
Stock (rm_kq=1, no env)	156.3	—
+ gfx queue	163.7	+4.7%
+ gfx + memory_priority	163.9	+4.9%
+ gfx + no-host	163.9	+4.9%
+ gfx + nopo	164.0	+4.9%
+ gfx + MMVQ disabled	164.0	+4.9%
+ gfx + coopmat2 disabled	163.7	+4.7%
+ gfx + F16 disabled	163.9	+4.9%
+ gfx + disable int dot	162.5	+4.0%
+ gfx + disable graph opt	157.3	+0.6%
+ gfx + disable coopmat	161.5	+3.3%
+ gfx + FA OFF	162.6	+4.0%
+ gfx + q4_0 KV	163.2	+4.4%
+ gfx + q8_0 KV	163.2	+4.4%
+ gfx + host memory	9.3	-94.0%
+ gfx + disable fusion	—	—

gfx queue gives +4.7% on AMDVLK 35B MoE. No other flag breaks through 164 t/s.

Qwen3.5-27B (Dense) — Decode tg128, rm_kq=1 active

RADV experiments

Flag	tg128	vs stock 29.30
Stock (rm_kq=1)	29.30	—
+ gfx queue	29.31	0%
+ nocompute	29.27	-0.1%
+ disable coopmat	29.26	-0.1%
+ force MMVQ	29.30	0%
+ disable int dot	29.32	+0.1%
+ FA OFF	29.35	+0.2%
+ disable F16	29.35	+0.2%
+ q8_0 KV	29.21	-0.3%
+ q4_0 KV	29.19	-0.4%
+ disable graph opt	28.99	-1.1%
+ disable fusion	27.81	-5.1%
+ SAM	29.27	-0.1%

Nothing moves RADV 27B decode. 29.3 t/s = hard BW ceiling (15.58 GiB × 29.3 = 456 GB/s = 71% of 640 GB/s).

AMDVLK experiments

Flag	tg128	vs stock 32.73
Stock (rm_kq=1, NO gfx)	32.73	— (BEST!)
+ gfx queue	30.07	-8.1%
+ gfx + disable coopmat	29.93	-8.6%
+ gfx + force MMVQ	30.03	-8.2%
+ gfx + disable int dot	30.08	-8.1%
+ gfx + disable graph opt	29.72	-9.2%
+ gfx + FA OFF	30.06	-8.2%
+ gfx + q8_0 KV	29.99	-8.4%
+ gfx + disable fusion	28.61	-12.6%
+ gfx + memory_priority	30.04	-8.2%
rm_kq=2 (default, no gfx)	28.97	-11.5%

AMDVLK + rm_kq=1 without gfx = best dense decode (32.73 t/s, +13% over stock rm_kq=2!)
gfx queue HURTS dense AMDVLK by -8% — opposite of MoE where it helps +4.7%.

rm_kq impact across all configs

Model	Driver	rm_kq=2	rm_kq=1	Improvement
35B MoE	RADV	147.8	149.5	+1.1%
35B MoE	AMDVLK (no gfx)	153.2	156.3	+2.0%
35B MoE	AMDVLK+gfx	—	163.7	—
27B Dense	RADV	29.07	29.30	+0.8%
27B Dense	AMDVLK (no gfx)	28.97	32.73	+13.0%

rm_kq=1 has the largest impact on AMDVLK dense decode (+13%). This suggests AMDVLK's LLPC compiler benefits significantly from reduced VGPR pressure on RDNA4 wave32 architecture. RADV's ACO compiler handles register allocation differently, gaining less from the same change.

Best Achievable Performance

35B MoE

	RADV	AMDVLK
Best decode	149.5 (rm_kq=1)	163.7 (gfx+rm_kq=1)
Best prefill pp2048	3,075 (ub=2048)	2,170 (gfx+ub=2048)

27B Dense

	RADV	AMDVLK
Best decode	32.5 (rm_kq=1, ASPM perf)	33.2 (rm_kq=1, ASPM perf, NO gfx!)
Best prefill pp2048	993 (Mesa 25.3.6, ASPM, ub=2048)	207 (ub=2048)

Dense decode improved by +10.8% on RADV and +14.5% on AMDVLK (vs stock rm_kq=2 + ASPM default) from combined rm_kq=1 + PCIe ASPM performance mode.

Key findings

Decode is 100% memory bandwidth bound. No flag or parameter breaks through the ceiling.
rm_kq=1 is the single most impactful code change: +1% RADV, +2% AMDVLK MoE, +13% AMDVLK dense.
gfx queue helps AMDVLK MoE (+4.7%) but hurts AMDVLK dense (-8%). Zero effect on RADV.
Use gfx queue for MoE, disable for dense when running AMDVLK.
RADV is the best single driver (best prefill on all models, competitive decode).

PCIe ASPM Discovery

Setting PCIe ASPM to performance mode eliminates L1 exit latency:

echo "performance" | sudo tee /sys/module/pcie_aspm/parameters/policy

Model	Driver	ASPM default	ASPM perf	Change
27B Dense	RADV	29.30	32.46	+10.8%
27B Dense	AMDVLK	32.73	33.17	+1.3%
35B MoE	RADV	149.5	149.4	0%
35B MoE	AMDVLK+gfx	163.0	163.9	+0.5%

ASPM L1 power saving adds latency to every PCIe transaction. Dense models suffer most because they read the entire model (~15.6 GB) every token with many small transactions. MoE models batch work more efficiently, hiding PCIe latency.

This is a system-level optimization — no code change, no driver change. Persists until reboot. To make permanent: add pcie_aspm.policy=performance to kernel boot parameters.

Known Issues

Kernel 6.19 RADV decode regression (theory): ~10% slower than kernel 6.17.0-14. AMDVLK unaffected. Suspected root cause in amdgpu DRM scheduler, but not yet confirmed via bisect.
AMDVLK dense prefill significantly slower: 4× slower than RADV on 27B dense (207 vs 823 t/s). Disabling coopmat (GGML_VK_DISABLE_COOPMAT=1) improves AMDVLK dense prefill by +17% (207→243) — suggests AMDVLK's cooperative matrix codegen is suboptimal for dense models. RADV's coopmat works correctly.
MCLK stuck on kernel 6.17: MCLK won't boost to 1258 MHz on kernel 6.17.0-14/19. Works fine on 6.19.8.
PCIe ASPM default = powersave: Resets on every reboot. Dense models lose 10% until set to performance.

Exhaustive Experiment Log (50+ combinations tested)

Parameters with REAL impact

Discovery	Model	Driver	Gain	How
PCIe ASPM=performance	Dense (all)	RADV	+10.8% decode	`echo performance > /sys/module/pcie_aspm/parameters/policy`
PCIe ASPM=performance	30B MoE	RADV	+10% decode	same
rm_kq=1	Dense 27B	AMDVLK	+13% decode	1 line in ggml-vulkan.cpp
rm_kq=1	MoE 35B	AMDVLK	+2% decode	same
-ub 2048	MoE 35B	RADV	+29% prefill pp2048	`-ub 2048 -b 16384`
Mesa 25.3.6	MoE 35B	RADV	+48% prefill pp2048	custom Mesa build
Mesa 25.3.6	Dense 27B	RADV	+21% prefill pp2048	custom Mesa build
gfx queue	MoE 35B	AMDVLK	+4.7% decode	`GGML_VK_ALLOW_GRAPHICS_QUEUE=1`
disable coopmat	Dense 27B	AMDVLK	+17% prefill	`GGML_VK_DISABLE_COOPMAT=1`

Parameters with ZERO impact (all tested, all confirmed ±0.3%)

RADV flags: gfx queue (on RADV), RADV_DEBUG=nocompute, RADV_PERFTEST=sam/bolist/localbos/dmashaders/nircache/hic/nogttspill, RADV_PROFILE_PSTATE

llama.cpp env vars: GGML_VK_DISABLE(F16/BF16/COOPMAT2/INTEGER_DOT_PRODUCT/ASYNC/GRAPH_OPTIMIZE), GGML_VK_FORCE_MMVQ, GGML_VK_DISABLE_MMVQ, GGML_VK_DMMV_LARGE, GGML_VK_ENABLE_MEMORY_PRIORITY, GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM, GGML_VK_FORCE_MAX_ALLOCATION_SIZE, GGML_VK_FORCE_MAX_BUFFER_SIZE, GGML_VK_SUBALLOCATION_BLOCK_SIZE (16MB and 1GB)

llama.cpp params: -t 1/4/12 (thread count), --no-host, -nopo (no-op-offload), -dio (direct-io), -mmp 0 (no mmap), -sm row (split mode), -b 1/2/512 (batch size), --prio 2 (scheduling priority), -ctk/-ctv q8_0/q4_0 (KV cache quant)

Code changes: rm_stdq=2, rm_kq_int=2, rm_stdq_int=2, rm_kq=3/4

System tuning: hugepages (16GB), transparent hugepages=always, CPU pinning (taskset), nice -n -20, GPU power profile (COMPUTE/3D_FULL_SCREEN)

DISABLE_FUSION is catastrophic: -18.5% on MoE, -5.1% on dense. Never disable.

Bandwidth utilization analysis

Model	Best decode	Model size	BW used	BW peak	Utilization
35B MoE (AMDVLK+gfx)	163.8 t/s	~2.4 GB/token	393 GB/s	640 GB/s	61%
35B MoE (RADV)	149.5 t/s	~2.4 GB/token	359 GB/s	640 GB/s	56%
27B Dense (AMDVLK+ASPM)	33.2 t/s	15.58 GB/token	517 GB/s	640 GB/s	81%
27B Dense (RADV+ASPM)	32.5 t/s	15.58 GB/token	506 GB/s	640 GB/s	79%

Dense models reach 79-83% BW utilization with ASPM fix. MoE models are lower (56-61%) due to dispatch overhead from expert routing. The remaining 17-20% gap on dense is primarily from:

ACO compiler inefficiency (31 redundant s_wait_kmcnt per Q4K GEMV shader)
Cache line waste (Q4K 144B blocks in 192B cache lines = 75% utilization)
Memory controller overhead (DRAM refresh, row precharge)

Please share your discoveries too — I'm curious what's the max we can get out of RDNA4.

zedbytes · 2026-03-27T12:28:03Z

zedbytes
Mar 27, 2026

@JohnTDI-cpu thanks again, do you mind sharing huggingface links to the two models you used

1 reply

JohnTDI-cpu Mar 27, 2026
Author

https://huggingface.co/JohnTdi/Qwen3.5-Unsloth-GGUF-R9700-Benchmark

The 35B file was downloaded from unsloth/Qwen3.5-35B-A3B-GGUF on 2025-02-25, I see now unsloth update oryginal file

zedbytes · 2026-03-27T17:59:15Z

zedbytes
Mar 27, 2026

@JohnTDI-cpu

All my stats seem to perform better overall. Also i'm running this on a custom LACT R9700 Profile
power : 210W
GPU Clock Offset : -500 MHz
Maximum VRAM Clock : 2518 MHz
Minimum VRAM Clock : 194 MHz
GPU voltage Offset : -88 mV

Differences I can tell which may have mattered are
GGML_VK_ALLOW_GRAPHICS_QUEUE=1
Mesa RADV 26.0.3
llama.cpp 8555

Models (GGUF)

Role	File	Quant
MoE 35B	`Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf`	UD-Q4_K_XL
Dense 27B	`Qwen3.5-27B-Q4_K_M.gguf`	Q4_K_M

System Configuration

Component	Details
GPU	2× AMD Radeon AI PRO R9700 at PCIE gen 3 x8 ; AMD Radeon PRO W6600 at PCIE gen 3 x4
Memory bandwidth	640 GB/s (MCLK 1258 MHz, level 5/5)
PCIe	PCIe 3.0 x8/x8/x4
CPU	Intel Core i9-9900K (8-core / 16-thread)
RAM	64 GB DDR4
OS	Ubuntu 24.04.4 LTS
Kernel	6.19.8-061908-generic
Mesa (RADV)	Mesa 26.0.3 - kisak-mesa PPA (`mesa-vulkan-drivers`)
AMDVLK	Not installed
llama.cpp	commit `48cda24c1175363dd17925102dbe1da49279940e` (build 8555)
Build	`cmake -B build -DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release`

Test Configuration

Environment

Variable	Value
`GGML_VK_VISIBLE_DEVICES`	`1` — single R9700 (first PCIe slot)
`GGML_VK_ALLOW_GRAPHICS_QUEUE`	`1`

`llama-bench` flags (every run)

Flag	Value
`-t` / threads	`1`
`-ngl`	`99`
`-fa`	`1` (flash attention on)
`-p`	`128,512,2048,8192` (prefill sizes)
`-n`	`128,512,2048` (decode / generation lengths)
`-r`	`3` (repetitions)

Not set: VK_ICD_FILENAMES

Configs (columns in results)

Column	Binary / batching
Stock RADV	Default `llama-bench` batching: `-b` 2048, `-ub` 512 (no extra flags). `ggml-vulkan.cpp`: `rm_kq = 2` (upstream default).
RADV+ub2048	Same binary as stock; add `-ub 2048` and `-b 16384`.
RADV+rm_kq1+ub2048	Rebuild with `uint32_t rm_kq = 1` in `ggml/src/ggml-vulkan/ggml-vulkan.cpp` (line that defaults to `2`); same flags as RADV+ub2048.

Backend is Vulkan / RADV via the cmake build (GGML_VULKAN=ON).

Detailed results: Qwen3.5-35B-A3B (MoE)

Decode

Context	Stock RADV	RADV+ub2048	RADV+rm_kq1+ub2048
tg128	153.4	153.1	155.6
tg512	152.3	151.7	151.8
tg2048	149.7	150.4	152.3

Prefill

Prompt	Stock RADV	RADV+ub2048	RADV+rm_kq1+ub2048
pp128	1839	1814	1742
pp512	3314	3265	3257
pp2048	3272	3964	3946
pp8192	3131	3846	3832

Detailed results: Qwen3.5-27B (Dense)

Decode

Context	Stock RADV	RADV+ub2048	RADV+rm_kq1+ub2048
tg128	32.25	32.17	32.06
tg512	32.27	32.21	32.06
tg2048	32.09	32.09	31.89

Prefill

Prompt	Stock RADV	RADV+ub2048	RADV+rm_kq1+ub2048
pp128	841	821	830
pp512	942	914	921
pp2048	933	923	930
pp8192	883	880	890

build: 48cda24c1 (8555)

Condensed comparison (three RADV configs)

All values are t/s from the detailed tables above. RADV+ub2048 and RADV+rm_kq1+ub2048 use absolute t/s with Δ vs Stock RADV in parentheses (percent, rounded).

Model	Test	Stock RADV	RADV+ub2048	RADV+rm_kq1+ub2048
MoE 35B	tg128	153.4	153.1 (−0.2%)	155.6 (+1.4%)
MoE 35B	pp512	3314	3265 (−1.5%)	3257 (−1.7%)
MoE 35B	pp2048	3272	3964 (+21.2%)	3946 (+20.6%)
MoE 35B	pp8192	3131	3846 (+22.8%)	3832 (+22.4%)
Dense 27B	tg128	32.25	32.17 (−0.2%)	32.06 (−0.6%)
Dense 27B	pp512	942	914 (−3.0%)	921 (−2.2%)
Dense 27B	pp2048	933	923 (−1.1%)	930 (−0.3%)
Dense 27B	pp8192	883	880 (−0.3%)	890 (+0.8%)

7 replies

Neutralized Mar 27, 2026

Could there be a improvement that allows rm_kq to be meybe set through environment or args? Would be interesting to test this, by the way for me graphic queue on dense model brings down perf by about 25%, probably because i use 2 amd gpus that arent evenly matched and different gens also.

zedbytes Mar 28, 2026

I've done some more tests , also added pp8192 and pp16384 running split on two R9700s with their own specific flags

Here's the stats

Environment

Variable	Value
`GGML_VK_VISIBLE_DEVICES`	`2,1` — hide W6600; both R9700s visible to ggml as Vulkan0 + Vulkan1
`GGML_VK_ALLOW_GRAPHICS_QUEUE`	`1`

`llama-bench` flags (every run)

Flag	Value
`-t`	`1`
`-ngl`	`99`
`-fa`	`1` (flash attention on)
`-dev`	`Vulkan1/Vulkan0`
`-sm`	`row` (tensor split mode)
`-ts`	`0.5/0.5`
`-p`	`128,512,2048,8192,16384`
`-n`	`128,512,2048`
`-r`	`3`

Single R9700 vs Dual R9700

Stock RADV:

Model	Test	Single Stock	Dual Stock	Δ (dual vs single)
MoE 35B	tg128	155.2	114.2	−26.4%
MoE 35B	pp512	3316	2853	−14.0%
MoE 35B	pp2048	3265	4384	+34.3%
MoE 35B	pp8192	3131	4724	+50.9%
MoE 35B	pp16384	2929	4420	+50.9%
Dense 27B	tg128	32.03	26.7	−16.6%
Dense 27B	pp512	928	947	+2.0%
Dense 27B	pp2048	918	1423	+55.0%
Dense 27B	pp8192	875	1547	+76.8%
Dense 27B	pp16384	849	1519	+78.9%

RADV+ub2048:

Model	Test	Single RADV+ub2048	Dual RADV+ub2048	Δ (dual vs single)
MoE 35B	tg128	155.1	114.7	−26.0%
MoE 35B	pp512	3254	2950	−9.3%
MoE 35B	pp2048	3950	3798	−3.8%
MoE 35B	pp8192	3825	5224	+36.6%
MoE 35B	pp16384	3511	5097	+45.2%
Dense 27B	tg128	32.00	27.0	−15.6%
Dense 27B	pp512	910	940	+3.3%
Dense 27B	pp2048	927	945	+1.9%
Dense 27B	pp8192	882	1366	+54.9%
Dense 27B	pp16384	858	1433	+67.0%

RADV+rm_kq1+ub2048:

Model	Test	Single rm_kq1+ub2048	Dual rm_kq1+ub2048	Δ (dual vs single)
MoE 35B	tg128	154.7	114.9	−25.7%
MoE 35B	pp512	3261	2923	−10.4%
MoE 35B	pp2048	3947	3793	−3.9%
MoE 35B	pp8192	3828	5241	+36.9%
MoE 35B	pp16384	3512	5101	+45.2%
Dense 27B	tg128	32.05	26.9	−16.1%
Dense 27B	pp512	933	948	+1.6%
Dense 27B	pp2048	944	954	+1.1%
Dense 27B	pp8192	896	1374	+53.4%
Dense 27B	pp16384	864	1442	+67.0%

@Neutralized

by the way for me graphic queue on dense model brings down perf by about 25%, probably because i use 2 amd gpus that arent evenly matched and different gens also

Not quite the same as your set up , but I've tested dual R9700 with graphics queue turned off for 27B dense model, and here are the results. I'm still better off keeping graphics queue flag on

Stock RADV

Test	Gfx queue `=1`	Gfx queue unset	Δ (unset vs `=1`)
tg128	26.7	25.8	−3.4%
tg512	26.8	25.9	−3.4%
tg2048	26.8	25.7	−4.1%
pp128	741	747	+0.8%
pp512	947	956	+1.0%
pp2048	1423	1438	+1.1%
pp8192	1547	1570	+1.5%
pp16384	1519	1540	+1.4%

RADV + ub2048

Test	Gfx queue `=1`	Gfx queue unset	Δ (unset vs `=1`)
tg128	27.0	25.8	−4.4%
tg512	26.9	25.5	−5.2%
tg2048	26.5	25.6	−3.4%
pp128	719	735	+2.2%
pp512	940	940	0.0%
pp2048	945	936	−1.0%
pp8192	1366	1354	−0.9%
pp16384	1433	1423	−0.7%

RADV + rm_kq1 + ub2048

Test	Gfx queue `=1`	Gfx queue unset	Δ (unset vs `=1`)
tg128	26.9	25.7	−4.5%
tg512	26.7	25.8	−3.4%
tg2048	26.5	25.7	−3.0%
pp128	741	740	−0.1%
pp512	948	945	−0.3%
pp2048	954	944	−1.0%
pp8192	1374	1362	−0.9%
pp16384	1442	1429	−0.9%

Neutralized Mar 28, 2026

Interesting, my setup is 9070XT with 6700XT, i also use layer sm row which gives me better perf, but due to nature of both cards i do ts 0.55, 0.45 on average. Maybe graphic queue hurts in this setup because of the 6700XT which is older card.

zedbytes Apr 10, 2026

tiny improvement but worth noticing

	Original best performing run rm_kq1 + ub2048	Re-run (2026-04-10) rm_kq1 + ub2048
llama.cpp build	`48cda24c1` (8555)	`e34f04215` (8740)
Mesa (RADV)	26.0.3	26.0.4
cmake flags	`-DGGML_VULKAN=ON -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release`	`-DGGML_VULKAN=ON -DGGML_NATIVE=ON -DGGML_LTO=ON -DCMAKE_BUILD_TYPE=Release -DVulkan_GLSLC_EXECUTABLE=$HOME/opt/1.4.341.1/x86_64/bin/glslc`
GLSLC	Ubuntu default 2023 version	LunarG SDK 1.4.341.1 (`-DVulkan_GLSLC_EXECUTABLE`)
Shader extensions	Default	COOPMAT, COOPMAT2, INTEGER_DOT, BFLOAT16 all ON
Kernel	6.19.8-061908-generic	6.19.8-061908-generic (same)
GPU / env vars	Same as before	added `GGML_VK_DISABLE_MMVQ=1` (disables only the auto-selection of the quantized mat-vec (MMVQ) path while keeping the integer dot MMQ (mat-mat) path active )
llama-bench flags	`-t 1 -ngl 99 -fa 1 -p 128,512,2048,8192,16384 -n 128,512,2048 -r 3 -ub 2048 -b 16384`	Same

Model	Test	b8555 (orig)	b8740 + DISABLE_MMVQ	Δ
MoE 35B	tg128	154.7	156.0	+0.8%
MoE 35B	tg512	154.4	155.3	+0.6%
MoE 35B	tg2048	152.7	153.4	+0.5%
MoE 35B	pp128	1813	1860	+2.6%
MoE 35B	pp512	3261	3328	+2.1%
MoE 35B	pp2048	3947	3971	+0.6%
MoE 35B	pp8192	3828	3891	+1.6%
MoE 35B	pp16384	3512	3570	+1.7%
Dense 27B	tg128	32.05	32.17	+0.4%
Dense 27B	tg512	32.04	32.16	+0.4%
Dense 27B	tg2048	31.89	32.04	+0.5%
Dense 27B	pp128	838	840	+0.2%
Dense 27B	pp512	933	946	+1.4%
Dense 27B	pp2048	944	950	+0.7%
Dense 27B	pp8192	896	902	+0.7%
Dense 27B	pp16384	886	873	−1.5%

Dual GPU comparison

Model	Test	Single (b8740 + DISABLE_MMVQ)	Dual (b8740 + DISABLE_MMVQ)	Δ (dual vs single)
MoE 35B	tg128	156.0	115.4	−26.0%
MoE 35B	tg512	155.3	116.9	−24.7%
MoE 35B	tg2048	153.4	115.8	−24.5%
MoE 35B	pp128	1860	1391	−25.2%
MoE 35B	pp512	3328	2986	−10.3%
MoE 35B	pp2048	3971	3797	−4.4%
MoE 35B	pp8192	3891	5245	+34.8%
MoE 35B	pp16384	3570	5135	+43.8%
Dense 27B	tg128	32.17	26.29	−18.3%
Dense 27B	tg512	32.16	26.71	−17.0%
Dense 27B	tg2048	32.04	26.89	−16.1%
Dense 27B	pp128	840	747	−11.1%
Dense 27B	pp512	946	950	+0.4%
Dense 27B	pp2048	950	969	+2.0%
Dense 27B	pp8192	902	1395	+54.7%
Dense 27B	pp16384	873	1460	+67.2%

zedbytes Apr 10, 2026

tried -sm tensor but not worth sharing yet as vulkan support isn't there yet

but had a go comparing -sm layer vs -sm row and results are mixed, slightly better overall with layer

MoE 35B — Decode

Context	`-sm row`	`-sm layer`	Δ
tg128	115.4	116.4	+0.8%
tg512	116.9	116.8	−0.1%
tg2048	115.8	115.8	0.0%

MoE 35B — Prefill

Prompt	`-sm row`	`-sm layer`	Δ
pp128	1391	1435	+3.2%
pp512	2986	2997	+0.4%
pp2048	3797	3836	+1.0%
pp8192	5245	5217	−0.5%
pp16384	5135	5135	0.0%

Dense 27B — Decode

Context	`-sm row`	`-sm layer`	Δ
tg128	26.3	27.2	+3.3%
tg512	26.7	26.8	+0.4%
tg2048	26.9	26.8	−0.4%

Dense 27B — Prefill

Prompt	`-sm row`	`-sm layer`	Δ
pp128	747	722	−3.3%
pp512	950	955	+0.5%
pp2048	969	965	−0.4%
pp8192	1395	1428	+2.4%
pp16384	1460	1473	+0.9%

digitalscream · 2026-04-17T11:29:26Z

digitalscream
Apr 17, 2026

Here's something interesting, for those folk with dual R9700s - messing with the power profiles can make a huge difference to TG.

sudo su
echo "high" > /sys/class/drm/card1/device/power_dpm_force_performance_level
echo "high" > /sys/class/drm/card2/device/power_dpm_force_performance_level

The results are...interesting, to say the least.

Before:

| model                          |       size |     params | backend    | ngl | type_k | type_v |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -----: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_K - Medium |  34.21 GiB |    60.33 B | Vulkan     |  99 |   q8_0 |   q8_0 |    row |  1 |           pp512 |     1758.08 ± 117.77 |
| qwen3next 80B.A3B Q4_K - Medium |  34.21 GiB |    60.33 B | Vulkan     |  99 |   q8_0 |   q8_0 |    row |  1 |          pp1024 |      2735.68 ± 46.49 |
| qwen3next 80B.A3B Q4_K - Medium |  34.21 GiB |    60.33 B | Vulkan     |  99 |   q8_0 |   q8_0 |    row |  1 |          pp2048 |       3110.86 ± 1.91 |
| qwen3next 80B.A3B Q4_K - Medium |  34.21 GiB |    60.33 B | Vulkan     |  99 |   q8_0 |   q8_0 |    row |  1 |           tg128 |         88.16 ± 1.19 |

After:

| model                          |       size |     params | backend    | ngl | type_k | type_v |     sm | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -----: | -----: | -----: | -: | --------------: | -------------------: |
| qwen3next 80B.A3B Q4_K - Medium |  34.21 GiB |    60.33 B | Vulkan     |  99 |   q8_0 |   q8_0 |    row |  1 |           pp512 |       1817.02 ± 8.52 |
| qwen3next 80B.A3B Q4_K - Medium |  34.21 GiB |    60.33 B | Vulkan     |  99 |   q8_0 |   q8_0 |    row |  1 |          pp1024 |       2291.08 ± 5.69 |
| qwen3next 80B.A3B Q4_K - Medium |  34.21 GiB |    60.33 B | Vulkan     |  99 |   q8_0 |   q8_0 |    row |  1 |          pp2048 |       2589.24 ± 1.22 |
| qwen3next 80B.A3B Q4_K - Medium |  34.21 GiB |    60.33 B | Vulkan     |  99 |   q8_0 |   q8_0 |    row |  1 |           tg128 |        100.16 ± 0.24 |

So a 14% increase in TG, but a significant decrease in PP. I don't have the tables to hand right now, but the change for Qwen3-30B-A3B-Q4_K_M was also interesting; 159t/s TG on a single card, 152t/s TG on two.

I'm not entirely sure how to explain that.

9 replies

zedbytes Apr 26, 2026

i tried with the flag but no luck on dual , same kernel and mesa as well.
@digitalscream you can change many GPU settings on LACT. Mine is configured for least amount of noise ,which is probably why I get 20-30 tok/sec hit on dual. I'm waiting for a water block but nothing in the market yet

digitalscream Apr 26, 2026

i tried with the flag but no luck on dual , same kernel and mesa as well. @digitalscream you can change many GPU settings on LACT. Mine is configured for least amount of noise ,which is probably why I get 20-30 tok/sec hit on dual. I'm waiting for a water block but nothing in the market yet

Hmmm...didn't realise LACT can be run headless. I'll have a go at that this week...to be honest, I wonder if a lot of the problem is that the clock scaling works with usage, such that when running across two GPUs and usage doesn't go over 50%, neither does the core clock. It looks like LACT will allow us to set the max clock as soon as it hits (say) 20%, so that it idles correctly but boosts to the top as soon as it's in use.

I'm lucky in that I don't really care about the noise, because that machine is in the room with my 3D printers; even with the R9700s running at max cooling, they're never heard.

digitalscream Apr 26, 2026

OK, using LACT and being quite conservative about it - all I did was boost the max power to 330W and set the profile to COMPUTE rather than the default - the GPUs have got a lot more boost-y. The net effect is that, when using both GPUs together with Qwen 3.6 35B A3B Q8_0, an increase of 4800t/s -> 5100t/s PP and 102t/s -> 110t/s TG. Not amazing, but not nothing either, considering it's largely safe and free performance.

Just don't try playing with the memory clocks, because the detected defaults are very wrong in LACT - I tried mucking around without increasing past the detected max...the fans instantly went nuts (I could hear them through two closed doors), the GPU wasn't detected any more and even a reboot didn't fix it because the LACT service kept firing up. I had to boot into recovery mode and delete /etc/lact/config.yaml just to get it back.

On the bright side, the remote management feature is ace :)

opticblu May 12, 2026

OK, using LACT and being quite conservative about it - all I did was boost the max power to 330W and set the profile to COMPUTE rather than the default - the GPUs have got a lot more boost-y. The net effect is that, when using both GPUs together with Qwen 3.6 35B A3B Q8_0, an increase of 4800t/s -> 5100t/s PP and 102t/s -> 110t/s TG. Not amazing, but not nothing either, considering it's largely safe and free performance.

Just don't try playing with the memory clocks, because the detected defaults are very wrong in LACT - I tried mucking around without increasing past the detected max...the fans instantly went nuts (I could hear them through two closed doors), the GPU wasn't detected any more and even a reboot didn't fix it because the LACT service kept firing up. I had to boot into recovery mode and delete /etc/lact/config.yaml just to get it back.

On the bright side, the remote management feature is ace :)

you have to undervolt it too to get better clocks, but these things ramp up aggressively, like very aggressively, screenshot attached

not sure how it works with vulkan but HIP/ROCm builds already are screaming fast on the clocks, they seem to auto-overclock when not rendering (eg when it's just straight compute), 3.4ghz+

you're memory bandwidth limited, I'd try to eek out another 50-150mhz from the VRAM and call it a day, you can try disabling ECC but it's like 1tok/s more decode

this is with no LACT/overclock, stock ubuntu 24.04 rocm stack, 3452MHz on GPU1 and 3393MHz on GPU2:

digitalscream May 12, 2026

@opticblu - yeah, the problem is that HIP/ROCm incurs such a large performance penalty that even without the Vulkan builds pushing dual-GPU clocks to the max, Vulkan is still significantly faster in both PP and TG. When under max load, the clocks rarely get above 2.4ghz with the stock power limit but they get to around 2.8GHz with the raised limit (even though they never get near that limit, usually cap out around 150-160W), and nvtop registers around 50% utilisation. Basically, raising the power cap seems to affect the scaling behaviour, which is absolutely fine by me.

Now, it'd be a different story if somebody with far more smarts than I could optimise --split-mode tensor for Vulkan or ROCm - that'd most likely allow much higher GPU utilisation. For now, though, it's more likely that we'll get DFlash support as the next performance boost on non-CUDA cards.

zedbytes · 2026-05-14T07:22:29Z

zedbytes
May 14, 2026

anyone tried and tested the new qwen 3.6 27B and 35B optimisations like MTP , Turboquant and Dflash ?

0 replies

aguaishuo · 2026-05-15T03:37:34Z

aguaishuo
May 15, 2026

AMD Radeon AI PRO R9700 (RDNA4, gfx1201) Benchmark & Optimization Results

AMD Radeon AI PRO R9700 (RDNA4, gfx1201) 测试结果与优化经验分享

Hardware / 硬件配置:

GPU: AMD Radeon AI PRO R9700 — 32GB GDDR6, 256-bit bus (~576 GB/s)
CPU: AMD Ryzen 7 5700X
OS: Ubuntu, llama.cpp Vulkan (RADV Mesa 26.0.3 in Docker)
Model / 模型: Qwen3.6-27B-Q4_K_M

Optimizations confirmed on RDNA4 / RDNA4上验证有效的优化项

All recommendations from this thread apply equally to gfx1201.
本帖所有优化建议在 gfx1201 上同样适用：

Optimization / 优化项	Effect / 效果
`echo performance > /sys/module/pcie_aspm/parameters/policy`	✅ effective / 有效
`echo high > /sys/class/drm/card1/device/power_dpm_force_performance_level`	✅ stable clocks / 稳定时钟
`-b 16384 -ub 2048`	✅ significant pp boost / 显著提升 pp
`VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json`	✅ recommended — 7 ICDs present otherwise / 建议显式指定，否则系统有7个ICD同时加载
`GGML_VK_ALLOW_GRAPHICS_QUEUE=1` — do NOT set for dense models	⚠️ confirmed -8% on dense / dense模型下确认-8%，勿设

Results — Vulkan + MTP (spec-draft-n-max=3, parallel=1)

测试结果 — Vulkan + MTP（spec-draft-n-max=3，parallel=1）

Metric / 指标	R9700 AI	7900XTX (this thread)	Ratio / 比值
Memory bandwidth / 内存带宽	~576 GB/s	~960 GB/s	60%
tg non-MTP base / 不开MTP基础速	~20 tok/s	~29 tok/s	69%
tg with MTP / 开MTP生成速	44–48 tok/s	~81 tok/s	~57%
pp (945 tokens)	~470–560 tok/s	~823 tok/s (pp2048)	—
MTP acceptance rate / MTP接受率	95–97%	—	—

Key finding / 关键发现

MTP provides a consistent ~2× speedup on RDNA4, compensating significantly for the narrower memory bus.
MTP在RDNA4上稳定提供约2倍加速，很大程度上弥补了内存总线较窄的劣势。

R9700's tg/bandwidth ratio is slightly better than 7900XTX (69% vs 60%), suggesting RDNA4 is marginally more compute-efficient per GB/s.
R9700的 tg/带宽比值略优于 7900XTX（69% vs 60%），说明RDNA4每GB/s的计算效率略高于RDNA3。

Why R9700 is slower than 7900XTX for decode / 为什么R9700解码比7900XTX慢：
Despite being newer and marketed for AI (1531 TOPS INT4), the R9700 uses a 256-bit bus vs 384-bit on 7900XTX. LLM token generation is memory-bandwidth bound — the AI accelerators help compute-bound tasks but not bandwidth-bound inference.
尽管R9700是更新的AI加速卡（1531 TOPS INT4），但它使用的是256-bit内存总线，而7900XTX是384-bit。LLM的token生成是内存带宽瓶颈，AI加速单元对算力密集型任务有帮助，但对带宽受限的推理没有直接收益。

1 reply

zedbytes May 16, 2026

@aguaishuo thanks for sharing
do you have tests on PP as well ? like PP8192 , pp65536 , pp131072 and pp262000

zedbytes · 2026-05-18T06:16:40Z

zedbytes
May 18, 2026

My preliminary results testing MTP out , the best performance boost are actually for single GPU R9700 27B Q4 , and the one I actually use daily is dual GPU 35B Q8, gets a modest 6-7% boost at the expense of PP and TTFT

do you guys have similar results , any tips on making 35B get as much boost as 27B does ?

MTP Benchmarks

llama.cpp:   release b9203 
AMD GPU:     Navi 48 [Radeon AI PRO R9700] (kernel 6.19.8-061908-generic)
RADV:        mesa vulkan driver v26.1.0

Single R9700 + Qwen3.6-27B-UD-Q4_K_XL + KV Q4_0

Prompt Processing (PP)

Test	baseline	mtp-1	mtp-2	mtp-3	mtp-4
pp128	736.8	692.1	698.1	697.3	716.9
pp2048	924.2	836.2	839.0	837.7	840.6
pp8192	888.9	803.5	801.4	803.4	803.1
pp16384	825.6	748.6	746.6	748.4	749.2

TTFT (seconds)

Test	baseline	mtp-1	mtp-2	mtp-3	mtp-4
pp128	0.3	0.3	0.3	0.3	0.3
pp2048	2.0	2.2	2.2	2.2	2.2
pp8192	9.0	10.0	10.0	10.0	10.0
pp16384	19.6	21.6	21.7	21.7	21.6

Text Generation (TG)

Test	baseline	mtp-1	mtp-2	mtp-3	mtp-4
tg128	28.2	41.6	44.9	47.4	46.4
tg512	29.6	44.3	51.7	51.2	55.5
tg4096	29.9	44.1	50.6	48.3	47.1
tg16384	29.8	44.6	50.4	49.5	46.0

TG Acceptance Rate

Test	baseline	mtp-1	mtp-2	mtp-3	mtp-4
tg128	-	95.4%	84.0%	81.1%	74.8%
tg512	-	88.9%	82.8%	72.1%	76.6%
tg4096	-	86.2%	77.8%	64.6%	59.4%
tg16384	-	87.3%	77.6%	67.9%	58.3%

Dual R9700 + Qwen3.6-35B-A3B-UD-Q8_K_XL + KV F16

Prompt Processing (PP)

Test	baseline	mtp-1	mtp-2	mtp-3	mtp-4
pp128	1027.2	842.0	855.8	860.3	836.9
pp2048	3195.7	2850.9	2864.4	2846.4	2849.7
pp8192	4121.6	3009.2	3027.8	3024.6	3051.7
pp16384	4700.5	2856.1	2860.3	2861.7	2874.3
pp65536	3248.8	1916.9	1913.7	1919.7	1921.4

TTFT (seconds)

Test	baseline	mtp-1	mtp-2	mtp-3	mtp-4
pp128	0.2	0.3	0.3	0.3	0.3
pp2048	0.6	0.7	0.7	0.7	0.7
pp8192	1.9	2.7	2.6	2.6	2.6
pp16384	3.5	5.7	5.7	5.7	5.6
pp65536	20.1	34.1	34.2	34.1	34.0

Text Generation (TG)

Test	baseline	mtp-1	mtp-2	mtp-3	mtp-4
tg128	72.8	76.7	83.9	82.8	71.0
tg2048	83.2	89.9	87.5	87.8	77.9
tg8192	83.7	89.2	91.1	82.2	77.1
tg16384	83.6	89.5	88.2	84.7	80.2
tg65536	83.5	88.9	89.3	87.8	75.6

TG Acceptance Rate

Test	baseline	mtp-1	mtp-2	mtp-3	mtp-4
tg128	-	100.0%	94.3%	93.0%	64.1%
tg2048	-	86.7%	73.2%	65.6%	54.1%
tg8192	-	85.3%	77.0%	57.4%	52.7%
tg16384	-	84.4%	75.2%	63.0%	55.4%
tg65536	-	85.0%	75.6%	64.9%	51.3%

4 replies

digitalscream May 18, 2026

A few things:

It really depends on the content; in Vulkan, I've found that the 27B can do ~48t/s in text responses, ~62t/s when writing code.
With release 9200, prefill gets a boost to 950-1000t/s on my setup.
ROCm is your friend with dense models, especially on dual-GPU setups. Running -sm tensor with MTP gets 1000t/s+ prefill, 58-72t/s decode depending on whether the response is text or code (or a mixture).
MoE models will never see as much of an increase as dense models of similar size; it's the activated parameter size that matters, so you're effectively trying to boost a 3B model with a 0.8B model; the MTP overhead is just too much to get an appreciable increase.

zedbytes May 18, 2026

A few things:

It really depends on the content; in Vulkan, I've found that the 27B can do ~48t/s in text responses, ~62t/s when writing code.

With release 9200, prefill gets a boost to 950-1000t/s on my setup.

ROCm is your friend with dense models, especially on dual-GPU setups. Running -sm tensor with MTP gets 1000t/s+ prefill, 58-72t/s decode depending on whether the response is text or code (or a mixture).

MoE models will never see as much of an increase as dense models of similar size; it's the activated parameter size that matters, so you're effectively trying to boost a 3B model with a 0.8B model; the MTP overhead is just too much to get an appreciable increase.

Thanks !
I can't for the life of me get ROCm with -sm tensor working with PP above 300-400 tok/sec. TG does improve over Vulkan both baseline and with MTP. Can you share which version of ROCm you are using ?
I used 7.2.3

maybe the reason for low PP is because i'm using pcie gen 3 at x8/x8 speeds !

digitalscream May 19, 2026

I don't actually have ROCm installed - I'm using a bare machine with just Vulkan installed, and for testing ROCm I'm using the Lemonade llama.cpp distro with ROCm bundled:

https://github.com/lemonade-sdk/llamacpp-rocm

Re: PCIE 3.0...yes, it's entirely possible that's at fault here.

zedbytes May 20, 2026

I don't actually have ROCm installed - I'm using a bare machine with just Vulkan installed, and for testing ROCm I'm using the Lemonade llama.cpp distro with ROCm bundled:

https://github.com/lemonade-sdk/llamacpp-rocm

Re: PCIE 3.0...yes, it's entirely possible that's at fault here.
that worked ! cheers

yiwiz-sai · 2026-05-19T01:56:07Z

yiwiz-sai
May 19, 2026

After a few days of testing, I would like to share some of my thoughts.

My system ubuntu 26.04，and I upgraded kernel to 7.0.0-15。

I tested vulkan and rocm, I think rocm is actully better when run Qwen3.6-27b (dense)

docker pull ghcr.io/ggml-org/llama.cpp:full-rocm --->it is b9209 when I tested

I run "amd-smi version" in container and check ROCm verion, and get：
AMDSMI Tool: 26.2.2+e1a6bc5663 | AMDSMI Library version: 26.2.2 | ROCm version: 7.2.1 | amdgpu version: Linuxversion7.0.0-15-generic(buildd@lcy02-amd64-048)(x86_64-linux-gnu-gcc(Ubuntu15.2.0-16ubuntu1)15.2.0,GNUld(GNUBinutilsforUbuntu)2.46)
...

At first, the r9700 is optimized specifically for data types such as INT4, INT8, and FP8. Therefore—whether for model precision or KV cache precision—you'd better use Q4 or Q8. Do not use Q6, nor the mixed-precision UD-Q4_K_XL; otherwise, performance will suffer significantly (I have already tested a lot), particularly when using the ROCm driver.
https://www.amd.com/en/products/graphics/workstations/radeon-ai-pro/ai-9000-series/amd-radeon-ai-pro-r9700.html

I used unsloth/Qwen3.6-27B-MTP-GGUF Qwen3.6-27B-Q8_0.gguf，I believe if you use Qwen3.6-27B-Q4_K_M.gguf will get better speed and context window.

Fortunately, 32GB of VRAM is just enough to accommodate a Q8 model combined with a q4_0 KV cache. However, you cannot set the parameters to -ub 2048 -b 16384, or you will run out of VRAM. Additionally, you cannot exceed two checkpoints (-ctxcp 2)，I awalys use 64k context window, which is sufficient for my needs. If you can accept 40k context window, you can adjust kv cache to q8_0.

Now, see my test.（Qwen3.6-27B-Q8_0.gguf + q4_0 kv cache）

Request Concurrency=1
llama | prompt eval time = 3817.17 ms / 3666 tokens ( 1.04 ms per token, 960.40 tokens per second)
llama | eval time = 23372.95 ms / 1000 tokens ( 23.37 ms per token, 42.78 tokens per second)
llama | total time = 27190.12 ms / 4666 tokens

Request Concurrency=1
llama | prompt eval time = 24630.81 ms / 22108 tokens ( 1.11 ms per token, 897.57 tokens per second)
llama | eval time = 28044.28 ms / 1000 tokens ( 28.04 ms per token, 35.66 tokens per second)
llama | total time = 52675.09 ms / 23108 tokens

Request Concurrency=1
llama | prompt eval time = 63075.54 ms / 47679 tokens ( 1.32 ms per token, 755.90 tokens per second)
llama | eval time = 32893.92 ms / 1000 tokens ( 32.89 ms per token, 30.40 tokens per second)
llama | total time = 95969.46 ms / 48679 tokens

Request Concurrency=1
llama | prompt eval time = 82972.57 ms / 58723 tokens ( 1.41 ms per token, 707.74 tokens per second)
llama | eval time = 35708.77 ms / 1000 tokens ( 35.71 ms per token, 28.00 tokens per second)
llama | total time = 118681.34 ms / 59723 tokens

The above is a single-request test for the Q8 model, which I consider sufficiently good. I'm too lazy to test Q4.

Below are the results of two concurrent tests:

Request Concurrency=2

thread1:
llama | prompt eval time = 7155.37 ms / 3680 tokens ( 1.94 ms per token, 514.30 tokens per second)
llama | eval time = 32801.60 ms / 1000 tokens ( 32.80 ms per token, 30.49 tokens per second)
llama | total time = 39956.97 ms / 4680 tokens
llama | draft acceptance rate = 0.77667 ( 699 accepted / 900 generated) ---> it is good !!!!!

thread2:
llama | prompt eval time = 6110.39 ms / 3680 tokens ( 1.66 ms per token, 602.25 tokens per second)
llama | eval time = 50425.68 ms / 1000 tokens ( 50.43 ms per token, 19.83 tokens per second)
llama | total time = 56536.08 ms / 4680 tokens
llama | draft acceptance rate = 0.31012 ( 481 accepted / 1551 generated) --->very bad !!!!!

As shown above, under concurrent conditions, the mtp for the second thread is extremely poor. I am not sure whether this is a bug. As pp continues to increase, tg deteriorates significantly, so I did not proceed with further testing.

Below is my Docker Compose configuration:
services:
llama-server:
image: ghcr.io/ggml-org/llama.cpp:full-rocm
#image: ghcr.io/ggml-org/llama.cpp:full-vulkan
container_name: llama
restart: unless-stopped
shm_size: '16gb'
ulimits:
memlock:
soft: -1
hard: -1
devices:
- /dev/kfd
- /dev/dri
ports:
- "8080:8080"
volumes:
- /home/yourname/ai/hf:/root/.cache/huggingface
# - /usr/share/vulkan/icd.d:/usr/share/vulkan/icd.d:ro
# - /etc/vulkan/icd.d/amd_icd64.json:/etc/vulkan/icd.d/amd_icd64.json
environment:
- HF_HUB_OFFLINE=1
# - VK_ICD_FILENAMES=/usr/share/vulkan/icd.d/radeon_icd.json
# - VK_ICD_FILENAMES=/etc/vulkan/icd.d/amd_icd64.json
# - GGML_VK_VISIBLE_DEVICES=0
# - GGML_VK_ALLOW_GRAPHICS_QUEUE=1 # do not set this
command:
# -s -cl, -s means "run llama-server", -cl means "list models"
-s -hf unsloth/Qwen3.6-27B-MTP-GGUF:Q8_0 -fa on -ngl 99 -ctk q4_0 -ctv q4_0 -t 8 -kvu -rea off --spec-type draft-mtp --spec-draft-n-max 3 --draft-p-min 0.5 --temp 0.8 --top-k 20 --top-p 0.95 --min-p 0.01 --no-mmap -ctxcp 2 -c 65536 --host 0.0.0.0 --port 8080 --metrics -np 2 -dev ROCm0 --fit-target 32
# -dev Vulkan1
# --no-warmup
# -hf unsloth/Qwen3.6-27B-MTP-GGUF:UD-Q4_K_XL
# -tools all

param refer to:
https://manpages.debian.org/unstable/llama.cpp-tools/llama-server.1.en.html

for kv cache type: as mentioned before, the speed of q4_0 is far better than q4_1, iq4_nl, q5_0, q5_1.
If you can accept 40k context window, you can adjust kv cache to q8_0.

"-sm row" option will not impact speed obviously, "-sm layer" is default.

before reproducing, you should config your machine：

Step 1: Enable Resizable BAR in BIOS
Navigate to Advanced → PCI in your BIOS settings and set Resizable BAR to Enabled.

Step 2: Lock GPU to Highest Performance Level (maybe you don't need this step, see below "two key findings")
There are two common approaches:

Method 1: Using rocm-smi (Recommend)
rocm-smi --setperflevel high
This is simple and automatically applies to all AMD GPUs.

Method 2: Directly via sysfs
echo 'high' > /sys/class/drm/cardXXX/device/power_dpm_force_performance_level
You need to identify the correct cardXXX.
List symlinks to find the device directory: ls -la /sys/class/drm/cardXXX/device
Find your GPU’s PCI address: lspci | grep -i amd
Match the PCI address with the symlink directory to determine the correct cardXXX.

Persistence:
Select either of the two commands above and add it to /etc/rc.local
Ensure the file starts with #!/bin/bash and is executable: chmod +x /etc/rc.local
Verification after Reboot:
Run rocm-smi and confirm that the performance level is set to high.

Step 3: Enable ASPM
First, set ASPM to Enabled in your BIOS under Advanced → Platform Misc Configuration. Then, enable performance mode in the OS using one of the two methods below:

Method 1: Via Kernel Boot Parameters (Recommend)
Edit /etc/default/grub and modify the GRUB_CMDLINE_LINUX_DEFAULT line.
Add the following parameter:
pcie_aspm.policy=performance
For this method, don't forget to run ./update-grub

Method 2: Using a Script
Log in and check the current policy:
cat /sys/module/pcie_aspm/parameters/policy
The default output is typically:
[default] performance powersave powersupersave
Switch to root and set the policy to performance:
sudo su root
echo performance | sudo tee /sys/module/pcie_aspm/parameters/policy
Verify the change:
The output should now show:
default [performance] powersave powersupersave
For this method, to make this persistent, you need to add command to /etc/rc.local.

Verification after Reboot:
cat /sys/module/pcie_aspm/parameters/policy

Step 4: Disable ECC for RX 9700 Series GPUs (I'm not sure if this has any impact, but I did it anyway.)
Edit /etc/default/grub and modify GRUB_CMDLINE_LINUX_DEFAULT.

Add the following parameter:
amdgpu.ras_enable=0

Verification after Reboot:
cat /sys/module/amdgpu/parameters/ras_enable

Finally, my /etc/default/grub contains:
GRUB_CMDLINE_LINUX_DEFAULT="amdgpu.runpm=0 pcie_aspm.policy=performance amdgpu.ras_enable=0"

then I run update-grub and verify that the kernel parameters appear correctly in /boot/grub/grub.cfg under the menu entries.
Remark:
The parameter amdgpu.runpm=0 is added to resolve GPU wake-up issues. On my system, without this setting, rocminfo cannot even detect the GPU.

Two key findings! (I only tested rocm)

First:
I had previously set the -c parameter to 65536. When generating 2,000 TGs (with 60,000 PP), this occasionally resulted in VRAM overflow errors. I have since changed the setting to -c 60000; after several test runs generating 5,000 TGs, everything is now running smoothly.

Second:
Running rocm-smi --setperflevel high did not yield better results than rocm-smi --setperflevel auto—in fact, it performed slightly worse.

The latter is more power-efficient (idle power consumption typically hovers around 20W in auto mode, compared to 50W in high mode), and I found that it consistently delivers superior performance during high-intensity computational tasks.

Shown below is the auto mode configuration: 60,000 PP + 5,000 TGs. As you can see, power consumption reached 298W, and the SCLK hit 3017 MHz.

rocm-smi
=========================================== ROCm System Management Interface ===========================================
===================================================== Concise Info =====================================================
Device Node IDs Temp Power Partitions SCLK MCLK Fan Perf PwrCap VRAM% GPU%
(DID, GUID) (Edge) (Avg) (Mem, Compute, ID)
========================================================================================================================
0 1 0x7551, 23334 70.0°C 298.0W N/A, N/A, 0 3017Mhz 1258Mhz 70.98% auto 300.0W 99% 100%

1 reply

digitalscream May 19, 2026

I tested vulkan and rocm, I think rocm is actully better。

Not necessarily - ROCm is faster for large (>20B) dense models, Vulkan is faster for MoE models with <20B active parameters.

d-shehu · 2026-05-24T06:05:43Z

d-shehu
May 24, 2026

The OP deserves a special gold star. These 2 optimizations boosted performance for Qwen3 122B from 15 t/s to 48 t/s with 40 t/s achieved with just #1. And it worked even with RPC using 2 nodes.

(1) -ub 2048 | MoE 35B | RADV | +29% prefill pp2048 | -ub 2048 -b 16384
(2) GGML_VK_ALLOW_GRAPHICS_QUEUE=1

I saw significant performance improvements with other MOE models like Qwen 3 Next 80. I have not yet seen any boost with dense models but I just started testing.

Node 1:
Ubuntu 24.04 HWE
2x R9700 AMD GPU
RADV Vulkan (AMD installer)
Llama.cpp b9222

Node 2:
Ubuntu 24.04 HWE
RTX 3090

Tyvm for this post!

0 replies

RDNA4 Llama Experiments — Squeezing Every Token/s from the R9700 #21043

Uh oh!

Uh oh!

System Configuration

Driver identification

Models Tested

Results: Qwen3.5-35B-A3B (MoE, 35B total, ~3.5B active)

Decode

Prefill

Results: Qwen3.5-27B (Dense, 27B)

Decode

Prefill

RADV vs AMDVLK

Optimization Impact (RADV)

rm_kq=1 code change

AMDVLK + rm_kq=1 (surprise finding)

Quality & VRAM Verification

Qwen3.5-35B-A3B — WikiText-2 Perplexity

Reproduction

Exhaustive Flag Testing

Qwen3.5-35B-A3B (MoE) — Decode tg128, rm_kq=1 active

RADV experiments

AMDVLK experiments

Qwen3.5-27B (Dense) — Decode tg128, rm_kq=1 active

RADV experiments

AMDVLK experiments

rm_kq impact across all configs

Best Achievable Performance

35B MoE

27B Dense

Key findings

PCIe ASPM Discovery

Known Issues

Exhaustive Experiment Log (50+ combinations tested)

Parameters with REAL impact

Parameters with ZERO impact (all tested, all confirmed ±0.3%)

Bandwidth utilization analysis

Replies: 8 comments · 23 replies

Uh oh!

Uh oh!

JohnTDI-cpu Mar 27, 2026 Author

Uh oh!

Uh oh!

Models (GGUF)

System Configuration

Test Configuration

Environment

llama-bench flags (every run)

Configs (columns in results)

Detailed results: Qwen3.5-35B-A3B (MoE)

Detailed results: Qwen3.5-27B (Dense)

Condensed comparison (three RADV configs)

Uh oh!

Uh oh!

Environment

llama-bench flags (every run)

Single R9700 vs Dual R9700

Stock RADV

RADV + ub2048

RADV + rm_kq1 + ub2048

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AMD Radeon AI PRO R9700 (RDNA4, gfx1201) Benchmark & Optimization Results

AMD Radeon AI PRO R9700 (RDNA4, gfx1201) 测试结果与优化经验分享

Optimizations confirmed on RDNA4 / RDNA4上验证有效的优化项

Results — Vulkan + MTP (spec-draft-n-max=3, parallel=1)

测试结果 — Vulkan + MTP（spec-draft-n-max=3，parallel=1）

Key finding / 关键发现

`rm_kq=1` code change

Replies: 8 comments 23 replies

JohnTDI-cpu Mar 27, 2026
Author

`llama-bench` flags (every run)

`llama-bench` flags (every run)