Name and Version
Windows 11
llama.cpp Vulkan
build: 55ac090 (9458) & build: 210a657 (9466)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
set GGML_VK_FORCE_MAX_ALLOCATION=1
set OMP_NUM_THREADS=8
set GGML_VK_DISABLE_COOPMAT=1
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-Q4_K_M-MTP.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_M.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 512,1024,2048,4096,8192,16384,32768 --n-gen 512,1024,2048,4096
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-Q4_K_M-MTP.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_M.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-Q4_K_M-MTP.gguf --mmap 0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf --mmap 0 --cache-type-k q4_1 --cache-type-v q4_1 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
C:\llama.cpp-prod\llama-bench.exe --model C:\llama.cpp-prod\models\Qwen3.6-27B-Q4_K_M-MTP.gguf --mmap 0 --cache-type-k q8_0 --cache-type-v q8_0 --repetitions 1 --n-prompt 0 --n-gen 128 --n-depth 0,4096,8192,16384,32768
Problem description & steps to reproduce
When context increases, the TPS decline of PP and TG accelerates rapidly.
Prompt Processing (pp) Performance:
| Context Size |
Tokens/s |
Change from Previous |
% Change |
| 512 |
392.48 |
- |
- |
| 1,024 |
391.77 |
-0.71 |
-0.18% |
| 2,048 |
385.26 |
-6.51 |
-1.66% |
| 4,096 |
364.78 |
-20.48 |
-5.32% |
| 8,192 |
330.55 |
-34.23 |
-9.38% |
| 16,384 |
278.57 |
-51.98 |
-15.73% |
| 32,768 |
211.34 |
-67.23 |
-24.15% |
Token Generation (tg) Performance:
| Context Size |
Tokens/s |
Change from Previous |
% Change |
| 128 |
26.76 |
- |
- |
| 4,096 |
22.72 |
-4.04 |
-15.10% |
| 8,192 |
19.57 |
-3.15 |
-13.86% |
| 16,384 |
15.25 |
-4.32 |
-22.07% |
| 32,768 |
10.57 |
-4.68 |
-30.69% |
Direct Comparison at Matching Context Points:
| Context Size |
PP (tokens/s) |
TG (tokens/s) |
PP/TG Ratio |
| 4,096 |
364.78 |
22.72 |
16.05x |
| 8,192 |
330.55 |
19.57 |
16.89x |
| 16,384 |
278.57 |
15.25 |
18.27x |
| 32,768 |
211.34 |
10.57 |
19.99x |
Rate of Decline Analysis:
Prompt Processing Decline (4096→32768):
364.78 → 211.34 = -153.44 tokens/s (-42.06%)
Token Generation Decline (4096→32768):
22.72 → 10.57 = -12.15 tokens/s (-53.47%)
Models tested:
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_M.gguf
Qwen3.6-27B-UD-Q4_K_XL.gguf
Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
Qwen3.6-27B-Q4_K_M-MTP.gguf
test3.txt
Verbose.txt
VP_VULKANINFO_Intel(R)_Arc(TM)_Pro_B70_Graphics_101_8517.json
First Bad Commit
No response
Relevant log output
Logs
Name and Version
Windows 11
llama.cpp Vulkan
build: 55ac090 (9458) & build: 210a657 (9466)
built with Clang 19.1.5 for Windows x86_64
Operating systems
Windows
Which llama.cpp modules do you know to be affected?
llama-server
Command line
Problem description & steps to reproduce
When context increases, the TPS decline of PP and TG accelerates rapidly.
Prompt Processing (pp) Performance:
Token Generation (tg) Performance:
Direct Comparison at Matching Context Points:
Rate of Decline Analysis:
Prompt Processing Decline (4096→32768):
364.78 → 211.34 = -153.44 tokens/s (-42.06%)
Token Generation Decline (4096→32768):
22.72 → 10.57 = -12.15 tokens/s (-53.47%)
Models tested:
gemma-4-26B-A4B-it-UD-Q4_K_XL.gguf
Qwen3.5-35B-A3B-uncensored-heretic-v2-Native-MTP-Preserved-Q4_K_M.gguf
Qwen3.6-27B-UD-Q4_K_XL.gguf
Qwen3.6-35B-A3B-UD-Q4_K_M.gguf
Qwen3.6-35B-A3B-UD-Q4_K_XL.gguf
Qwen3.6-27B-Q4_K_M-MTP.gguf
test3.txt
Verbose.txt
VP_VULKANINFO_Intel(R)_Arc(TM)_Pro_B70_Graphics_101_8517.json
First Bad Commit
No response
Relevant log output
Logs