Skip to content

Eval bug: Request for Qwen3-Next-80B-A3B Vulkan Inference Optimization #17751

@engrtipusultan

Description

@engrtipusultan

Name and Version

Details

llama-server --version
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
version: 7243 (13628d8)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0 for x86_64-linux-gnu

Operating systems

Linux

GGML backends

Vulkan

Hardware

Details

Device Name AMD Radeon Graphics
PCI (domain:bus:dev.func) 0000:03:00.0
DeviceID:RevID 0x15E7.0xC1
OpenGL Driver Version Mesa 25.3.0 - kisak-mesa PPA
gfx_target_version gfx90c

GPU Type APU
Family Raven (RV)
ASIC Name Renoir
Chip Class GFX9
Shader Engine (SE) 1
Shader Array (SA/SH) per SE 1
CU per SA 8
Total CU 8
RenderBackendPlus (RB+) 2 (16 ROPs)
Peak Pixel Fill-Rate 32 GP/s
GPU Clock 200-2000 MHz
Peak FP32 2048 GFLOPS

VRAM Type DDR4
VRAM Bit Width 128-bit
VRAM Vendor Unknown
VRAM Size 16384 MiB
Memory Clock 400-1333 MHz
ResizableBAR Enabled
ECC Memory Not Supported

L1 Cache (per CU) 16 KiB
L2 Cache 1024 KiB (4 Banks)

Supported Power Profiles[
"3D_FULL_SCREEN",
"VIDEO",
"VR",
"COMPUTE",
"CUSTOM",
]

Models

Qwen3-Next-80B-A3B

Problem description & steps to reproduce

Qwen3-Next-80B-A3B current implementation is not optimized. It is much slower as compared to other A3B Qwen models. In coming weeks/months as per your free time, please support to make it optimized.

First Bad Commit

N/A

Relevant log output

Qwen3-Next-80B-A3B-Instruct llama-bench

Details

bash  llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Next/Qwen3-Next-80B-A3B-Instruct-UD-Q4_K_XL.gguf -ngl 99 --ubatch-size 128,512 --batch-size 2048 --mmap 0 -fa 0,1 --prio 3
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3next ?B Q4_K - Medium     |  42.01 GiB |    79.67 B | Vulkan     |  99 |      128 |  0 |    0 |           pp512 |         32.24 ± 0.45 |
| qwen3next ?B Q4_K - Medium     |  42.01 GiB |    79.67 B | Vulkan     |  99 |      128 |  0 |    0 |           tg128 |          8.79 ± 0.00 |
| qwen3next ?B Q4_K - Medium     |  42.01 GiB |    79.67 B | Vulkan     |  99 |      128 |  1 |    0 |           pp512 |         32.28 ± 0.51 |
| qwen3next ?B Q4_K - Medium     |  42.01 GiB |    79.67 B | Vulkan     |  99 |      128 |  1 |    0 |           tg128 |          8.82 ± 0.01 |
| qwen3next ?B Q4_K - Medium     |  42.01 GiB |    79.67 B | Vulkan     |  99 |      512 |  0 |    0 |           pp512 |         35.20 ± 0.23 |
| qwen3next ?B Q4_K - Medium     |  42.01 GiB |    79.67 B | Vulkan     |  99 |      512 |  0 |    0 |           tg128 |          8.80 ± 0.01 |
| qwen3next ?B Q4_K - Medium     |  42.01 GiB |    79.67 B | Vulkan     |  99 |      512 |  1 |    0 |           pp512 |         35.16 ± 0.23 |
| qwen3next ?B Q4_K - Medium     |  42.01 GiB |    79.67 B | Vulkan     |  99 |      512 |  1 |    0 |           tg128 |          8.79 ± 0.01 |

build: 13628d8bd (7243)

Qwen3-30B-A3B-Thinking-2507 llama-bench More than double inference speed for pps and tg

Details

bash  llama-bench -m /home/tipu/AI/models/unsloth/Qwen3-Think-A3B-GGUF/Qwen3-30B-A3B-Thinking-2507-UD-Q4_K_XL.gguf -ngl 99 --ubatch-size 128,512 --batch-size 2048 --mmap 0 -fa 0,1 --prio 3
load_backend: loaded RPC backend from /home/tipu/Applications/llamacpp/libggml-rpc.so
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon Graphics (RADV RENOIR) (radv) | uma: 1 | fp16: 1 | bf16: 0 | warp size: 64 | shared memory: 65536 | int dot: 0 | matrix cores: none
load_backend: loaded Vulkan backend from /home/tipu/Applications/llamacpp/libggml-vulkan.so
load_backend: loaded CPU backend from /home/tipu/Applications/llamacpp/libggml-cpu-haswell.so
| model                          |       size |     params | backend    | ngl | n_ubatch | fa | mmap |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -------: | -: | ---: | --------------: | -------------------: |
| qwen3moe 30B.A3B Q4_K - Medium |  16.49 GiB |    30.53 B | Vulkan     |  99 |      128 |  0 |    0 |           pp512 |         55.86 ± 0.91 |
| qwen3moe 30B.A3B Q4_K - Medium |  16.49 GiB |    30.53 B | Vulkan     |  99 |      128 |  0 |    0 |           tg128 |         20.83 ± 0.03 |
| qwen3moe 30B.A3B Q4_K - Medium |  16.49 GiB |    30.53 B | Vulkan     |  99 |      128 |  1 |    0 |           pp512 |         53.52 ± 0.67 |
| qwen3moe 30B.A3B Q4_K - Medium |  16.49 GiB |    30.53 B | Vulkan     |  99 |      128 |  1 |    0 |           tg128 |         20.72 ± 0.03 |
| qwen3moe 30B.A3B Q4_K - Medium |  16.49 GiB |    30.53 B | Vulkan     |  99 |      512 |  0 |    0 |           pp512 |         89.25 ± 0.20 |
| qwen3moe 30B.A3B Q4_K - Medium |  16.49 GiB |    30.53 B | Vulkan     |  99 |      512 |  0 |    0 |           tg128 |         20.87 ± 0.01 |
| qwen3moe 30B.A3B Q4_K - Medium |  16.49 GiB |    30.53 B | Vulkan     |  99 |      512 |  1 |    0 |           pp512 |         84.70 ± 0.53 |
| qwen3moe 30B.A3B Q4_K - Medium |  16.49 GiB |    30.53 B | Vulkan     |  99 |      512 |  1 |    0 |           tg128 |         20.75 ± 0.13 |

build: 13628d8bd (7243)

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions