Skip to content

# Vulkan: Performance degradation when context increases a small amount #24005

@johnkarlhill

Description

@johnkarlhill

Name and Version

Windows 11
llama.cpp Vulkan
build: 55ac090 (9458) & build: 210a657 (9466)
built with Clang 19.1.5 for Windows x86_64

Operating systems

Windows

Which llama.cpp modules do you know to be affected?

llama-server

Command line

set GGML_VK_FORCE_MAX_ALLOCATION=1
set OMP_NUM_THREADS=8
set GGML_VK_DISABLE_COOPMAT=1

C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-Q4_K_M-MTP.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_M.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096

C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-Q4_K_M-MTP.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_M.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-Q4_K_M-MTP.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768

C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmap 0 --cache-type-k q4_1 --cache-type-v q4_1 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-Q4_K_M-MTP.gguf --mmap 0 --cache-type-k q8_0 --cache-type-v q8_0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768

Problem description & steps to reproduce

When context increases, the TPS decline of PP and TG accelerates rapidly.

Prompt Processing (pp) Performance:

Context Size Tokens/s Change from Previous % Change
512 392.48 - -
1,024 391.77 -0.71 -0.18%
2,048 385.26 -6.51 -1.66%
4,096 364.78 -20.48 -5.32%
8,192 330.55 -34.23 -9.38%
16,384 278.57 -51.98 -15.73%
32,768 211.34 -67.23 -24.15%

Token Generation (tg) Performance:

Context Size Tokens/s Change from Previous % Change
128 26.76 - -
4,096 22.72 -4.04 -15.10%
8,192 19.57 -3.15 -13.86%
16,384 15.25 -4.32 -22.07%
32,768 10.57 -4.68 -30.69%

Direct Comparison at Matching Context Points:

Context Size PP (tokens/s) TG (tokens/s) PP/TG Ratio
4,096 364.78 22.72 16.05x
8,192 330.55 19.57 16.89x
16,384 278.57 15.25 18.27x
32,768 211.34 10.57 19.99x

Rate of Decline Analysis:
Prompt Processing Decline (4096→32768):
364.78 → 211.34 = -153.44 tokens/s (-42.06%)

Token Generation Decline (4096→32768):
22.72 → 10.57 = -12.15 tokens/s (-53.47%)

Models tested:
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_M.gguf
Qwen3.6-27B-UD-Q4_K_XL.gguf
Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
Qwen3.6-27B-Q4_K_M-MTP.gguf

test3.txt
Verbose.txt

VP_VULKANINFO_Intel(R)_Arc(TM)_Pro_B70_Graphics_101_8517.json

First Bad Commit

No response

Relevant log output

Logs

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions