Bad performance of Qwen3.5-122B on 7900 XTX #20013

shilga · 2026-03-01T16:04:27Z

shilga
Mar 1, 2026

Hi,
I'm experiencing performance issues of the new Qwen3.5-122B on my 7900 XTX. I've tested ROCM, Vulkan. Different Quants, the performance is just bad. I'm getting only 13t/s, while I see the GPU only utilized with 30-40% load. The CPU with several cores unused. I'm not sure what is happening. I see other people mentioning getting 20+t/s on a 4070 Super. My 7900 XTX needs far less offloading with it's VRAM. I'm really not sure what is happening, the system is usually not performing that bad. With the Qwen3-Coder-Next I get 30t/s.

Would be nice if I could get some other input to compare, maybe there are even other people with a 7900 XTX.

fighter3005 · 2026-03-02T08:15:00Z

fighter3005
Mar 2, 2026

Perhaps unrelated, but I tested 35B on a w7800 48gb and on a rtx pro 4000 with (both with --cpu-moe), and tg is about 1.5x faster on the rtx, while prompt processing is 5-6x faster. Even if the model is fully loaded into the w7800s vram, I get ~300 t/s pp and ~40 t/s tg. That means pp is still 4x times slower than on the rtx pro 4000 WITH OFFLOADING... I am using the official docker compose file.

0 replies

tomtom13 · 2026-04-26T23:49:57Z

tomtom13
Apr 26, 2026

FYI, if you have ANY ofloading, none of your numbers are representative. Let us know what are the numbers with with everything inside of VRAM.

@fighter3005 That's interesting bit of information. Can you elaborate on how you've set those up etc ? I'm now pondering going 7900xtx OR used 3090 (few of them to build an local ai rig - don't want to waste more $ on cloud ai)

10 replies

segmond Apr 29, 2026

It is relevant. For some reason Qwen3.5 and Qwen3.6 is super slow on AMD with llama.cpp in comparison to Nvidia. I don't have a Nvidia GPU so I can't check myself. I need to trust numbers I find on the internet. Somewhere is a bottleneck. Just so you get some numbers the 7900XTX gets:

Qwen3.5 27B UD-Q4_K_XL: 29t/s

Qwen3.5 35B UD-Q4_K_XL: 87t/s

Qwen3-Coder-Next 80B UD-Q4_K_XL: 28t/s (with MoE offload)

GPT-OSS 20B UD-Q4_K_XL: 150t/s

Trade your AMD for some Nvidia. I own AMD & CUDA GPUs and CUDA always beats AMD. On paper/specs they sometimes look the same, but in practice, CUDA wins. With AMD, you get cheap at the expense of speed. It's still much faster than running on system ram. But you just have to be happy with your performance and not jealous of CUDA performance.

shilga Apr 29, 2026
Author

It is relevant. For some reason Qwen3.5 and Qwen3.6 is super slow on AMD with llama.cpp in comparison to Nvidia. I don't have a Nvidia GPU so I can't check myself. I need to trust numbers I find on the internet. Somewhere is a bottleneck. Just so you get some numbers the 7900XTX gets:

Qwen3.5 27B UD-Q4_K_XL: 29t/s

Qwen3.5 35B UD-Q4_K_XL: 87t/s

Qwen3-Coder-Next 80B UD-Q4_K_XL: 28t/s (with MoE offload)

GPT-OSS 20B UD-Q4_K_XL: 150t/s

Trade your AMD for some Nvidia. I own AMD & CUDA GPUs and CUDA always beats AMD. On paper/specs they sometimes look the same, but in practice, CUDA wins. With AMD, you get cheap at the expense of speed. It's still much faster than running on system ram. But you just have to be happy with your performance and not jealous of CUDA performance.

That is only because AMD is an afterthought in most implementations. Same here, HIP uses the same kernel than CUDA with a glue layer if I remember correctly. It works but isn't fast. And no, I will never get Nvidia again. I had very bad experience with it in the past. Nvidia is non-existent to me. Anyway for the price of a used 3090 I can get two 7900 XTX or a new r9700 and have more VRAM, which always wins.

shilga Apr 29, 2026
Author

ooooh do tell ;) could you give me a simple VS - some small model let's say some qwen3.6 7b (don't know the small ones how many parameter they have) at UD q4xl on both 7900xtx VS 9070 XT please. (obviously in linux with newest drivers, and I care about 32k prompt processing speed more than TG)

I've got my finger on the trigger to go for triple 9070xt setup, and would like to know whenever I'm going to shoot an own goal.

Not sure I can. The 9070 is in a rig I don't have access right now, sorry. Just my opinion: the 9070 doesn't have enough VRAM to be a good option.

Look on localmaxxing.com to get some numbers

NickM-27 Apr 29, 2026

On my 7900XTX. Vulkan is faster than ROCm at everything after running batch sizes. Qwen 3.6 35B gets 2700 tok/s prefill and 120 tok/s generation with 2048 b and 1024 ub

tomtom13 Apr 30, 2026

@NickM-27 but please state at what size of input context you actually get this pre-fill rate ! it's one thing to process 1024 token input, another to process 64k token input ... and for dev, I'm usually way above 32k, sometimes touching 150k (yes hallucination territory [or what my mate called "128k lala land" )

jack10768 · 2026-04-30T14:05:42Z

jack10768
Apr 30, 2026

I am getting 230 t/s pp and 32 t/s decode with llama.cpp b89+, for qwen3.5 122b a10b UD-Q6_K_XL. Flash attention on, ctk bf16, ctv bf16, ctx 96k, batch 256, ubatch 64, parallel 1.

Software stack: 7.2.0, ubuntu 24.06, hwe 6.17, patched the amd dkms from the linux 7.0 one to resolve card pinned at 90w when mmproj loaded even when idle.

Hardware stack: 4x 9700, gigabyte mc62-g40, 5955wx, 128gb ddr4 2133 mhz.

5 replies

tomtom13 May 1, 2026

@jack10768 could you tell me if you get better PP if you run model on one 9700 vs two 9700 ? and what sort of speedup you get of one vs two ? I wanted to setup multi 9070xt ... but that single 9700 sounds sweet.

jack10768 May 1, 2026

Quick and dirty test - 3243 token prompt on qwen 3.6 27b UD-q8_k_xl llama.cpp:

1 card - PP 3243 tokens, 196.06/s, TG 5427 tokens, 9.31/s
2 cards - PP 3243 tokens, 354.77/s, TG 4779 tokens, 16.18/s
4 cards - PP 3243 tokens, 328.5/s, TG 5099 tokens, 15.17/s

tomtom13 May 1, 2026

@jack10768 much much oblidged ! rocm / vulkan ?

jack10768 May 1, 2026

This is on rocm 7.2.0. I heard you can get better performance on vulkan but i tend to work with pretty large prompts and apparently rocm is better once prompts get large.

tomtom13 May 1, 2026

yeah I just wanted to see how scaling behaves on those cards. TBH, I was looking into 4 9070xt ... but seeing that performance colapses on 4 compared to 2 makes me wonder whenever r7900 might be a better choice. 64gb would be enough for my use-case of maybe max4 pipes and circa 70b q6ud model with context up to 256k. However I'm hoping to get better performance than dgx spark. since 2400$ is starting to touch that territory.

Bad performance of Qwen3.5-122B on 7900 XTX #20013

Uh oh!

Replies: 3 comments · 15 replies

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shilga Apr 29, 2026 Author

Uh oh!

shilga Apr 29, 2026 Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Replies: 3 comments 15 replies

shilga Apr 29, 2026
Author

shilga Apr 29, 2026
Author